Stat 5969 Statistical Software PackagesAnalysis Tools - 1Data Analysis Tools• This section of the notes is meant to introduce you to many of the tools thatare provided by Excel under the Tools/Data Analysis menu item. If yourcomputer does not have that tool loaded, you need to go to Tools/Add-Insand then check the box Analysis ToolPak. When you do so, you may beprompted to enter your original CD to load the tools.Tools for Summarizing Data• There are two principal analysis tools for summarizing data. They are“Histogram” and “Descriptive Statistics.”Histograms• We can use a spreadsheet to obtain a histogram. In the process it finds thefrequency distribution and then it will draw the plot. It also has the option offinding an ogive. Below is the procedure.1. To get to the Analysis Tools, select Tools/Data Analysis. This willbring up the list of statistical methods.2. Select the tool entitled "Histogram." The dialog box below will thenappear. All of the analysis tools in Excel provide a similar dialog box
Stat 5969 Statistical Software PackagesAnalysis Tools - 23. In the dialog box specify where the data are you want to analyze andwhere you want the output to go. Specify the location of the dataeither by typing the cell range, or by dragging the mouse over the cellscontaining the data. For now, skip the box asking for the bin range(see below for how to use the bin range input). If you have indicatedthe row that has the variable name or heading, click in the labels box.In the box asking for the output range, type or click on the cellreference where you want the output to begin. Do not mark the boxnext to "Pareto." If you want Excel to draw the histogram, click in theappropriate box. The “Cumulative Percentage” box will give you theogive. Then click OK.• The result of this procedure will be a frequency distribution. The firstcolumn will show the value which defines the right (or maximum) value ofthe class interval, which Excel refers to as a “bin.” The second column willshow the number of observations in the bin, and the third column willcontain the cumulative percentage of observations falling in or below the bin.
Stat 5969 Statistical Software PackagesAnalysis Tools - 4• The way to interpret the frequency distribution is as follows. The firstfrequency number is the number of data points that have values less than orequal to the first bin number. The next frequency number is the number ofdata points less than or equal to the second bin number, but greater than thefirst bin number. For example, in what is above, there is one number in thedata set that is less than or equal to 16.9. There are 2 numbers in the dataset larger than 16.9 and less than or equal to 18.84. The other numbers areinterpreted similarly. The last bin always says “More.” The correspondingfrequency number tells us how many numbers in the data set are larger thanthe second to last bin number. In the example above, 1 number in the dataset is larger than 44.06.Minor Fixes to Excel’s Output• There are two things about Excel’s histogram output that I don’t like. Thefirst is the way it handles the first bin. It always sets the first bin value equalto the smallest number in the data set. Hence its frequency is almost alwaysequal to 1. In almost every case, I choose to combine this bin with the nextone. To do so, I add the frequency of this first bin to the frequency of thesecond bin, and then delete the first row of the output given by Excel. Forthe example above, the first two rows of my modified frequency distributionwould look like this.Bin Frequency Cumulative %18.84 3 1.25%20.78 13 6.67%
Stat 5969 Statistical Software PackagesAnalysis Tools - 5• The second thing that I don’t like is that the chart that Excel automaticallyconstructs is actually a bar graph. To make it look more like a histogram,we need to have no space between the bars. To remove the space, doubleclick on the bars of the chart, then select the Options tab, and change theGap width to 0, then select OK.Selecting Your Own Bin Values• If you don’t like the bin values that Excel uses, you can create your own.Below I describe the process that I would follow to do it. As you can see, itis quite a bit longer, and my preference is to let Excel choose the binsvalues.1. First determine the number of bins. Say that the number of observationsyou have is n. Then a rule for the number of bins is (2*n)1/3(i.e., thecube root of 2n). You will usually have to round this number to aninteger. The usual suggestion is to round up. For the example above,there were 240 data points. Then (2*240)1/3= 7.83. We round up to 8to get 8 bins.2. To find the bin width, take the range of the data (largest minus smallest),and divide by the number of bins found in step 1 above. Again you willwant to round up to determine the actual bin width, but it is quitesubjective as to how to round (you can go to the nearest integer, tenth,hundredth, etc.). For the example, the smallest and largest of the 240values were 16.9 and 46. To find the interval, we use (46-16.9)/8 = 3.64.The original data had two decimal places, so it is convenient to use twodecimal places for the bin width. To make it an “even” number, Idecided to use 3.65 as the bin width.
Stat 5969 Statistical Software PackagesAnalysis Tools - 63. When creating the bin boundaries, I take the smallest number and add binwidth to it to obtain the starting bin value. If you don’t like fractions or“uneven” numbers, you can round to a neighbor that fits your criteria fora good starting value. Excel will take the first number that you put in thebin range, and then find how many numbers in the data set are less thanor equal to that number. Then it will take the 2nd number in the binrange, and find how many are greater than the first bin number, but lessthan or equal to the second bin number.For the example, say my original data are in cells A2:A241 and cell A1contains a label. In cells C2:C8 I can enter the numbers 20.5, (which isclose to 16.9 + 3.65), 24.15, 27.8, 31.45, 35.1, 38.75, 42.4 (notice I onlyentered 7 numbers, even though there are 8 bins—the 8thbin will becreated by Excel and called “More”). In cell C1 I should enter somelabel for the bins. The most obvious choice is to just type “Bin” in C1.(If you check the “Labels in First Row” box, you must add a label to thebins as well.) Now use Data Analysis from the Tools menu. InputA1:A241 in the data input range. In the bin input range, enter C1:C8.Choose the other options as normal. Then hit OK.• Below is the resulting output, including the chart (after adjusting the gapwidth to 0).Bin Frequency Cumulative %20.5 13 5.42%24.15 88 42.08%27.8 93 80.83%31.45 28 92.50%35.1 13 97.92%38.75 2 98.75%42.4 2 99.58%More 1 100.00%
Stat 5969 Statistical Software PackagesAnalysis Tools - 7• The interpretation of the frequency distribution is exactly the same as before.Histogram02040608010020.5 24.15 27.8 31.45 35.1 38.75 42.4 MoreBinFrequency.00%20.00%40.00%60.00%80.00%100.00%120.00%
Stat 5969 Statistical Software PackagesAnalysis Tools - 8Descriptive Statistics• To use Excel to obtain a listing of descriptive statistics, we again use theAnalysis Tools. This time, instead of selecting "Histogram," select"Descriptive Statistics." Indicate where the data are located, and selectwhether they are in rows or columns. If you want the data set to have adescriptive title, you can include the label in the first entry above the data,and then click the box next to "Labels." Specify where you want the outputto go. I recommend always clicking on the “Summary Statistics” box. Ialso recommend checking the Confidence Level for Mean box (and filling inthe confidence level) if you are interested in confidence intervals for themean. I rarely use the other boxes.• You can do descriptive statistics on several variables at once. You just needto be sure that the variables are next to each other in the spreadsheet, andthen refer to all the columns in the input portion of the dialog box.• Here is some example output.HeightMean 1.2Standard Error 0.01206Median 1.18Mode 1.23Standard Deviation 0.04Sample Variance 0.0016Kurtosis -1.11302Skewness 0.50417Range 0.12Minimum 1.15Maximum 1.27Sum 13.2Count 11Confidence Level(99.0%) 0.03822
Stat 5969 Statistical Software PackagesAnalysis Tools - 9Box plots• There is nothing built in to Excel to do box plots. I have created a templatethat will do up to 4 simultaneous box plots. It is also limited to data sets ofno more than 500 observations. It has some faults, but it is not bad. Thefile is called Multiple Boxplots.xls. Below is a sample of what it produces.Covariance• A covariance matrix can be obtained from the spreadsheet by using theCovariance Analysis Tool. Select Data Analysis, then Covariance. Identifythe input area, and where you would like the output to go. Indicate whetherthe data are grouped by column or row, and whether labels are being used,and then select OK. Example output is shown at the top of the next page.**WARNING**This analysis tool divides the cross products by n rather than by n-1. If youwant true sample variances and covariances, you should multiply all of thenumbers by1−nn.0 10 20 30 40 50AutomobilePublic
Stat 5969 Statistical Software PackagesAnalysis Tools - 10Day Hour Prep Time Wait Time TravelTimeDistanceDay 3.906276Hour 0 5.271967Prep Time 0.212155 -0.19626 1.110149Wait Time 1.123469 -0.83906 0.310482 10.60447Travel Time 0.193243 0.221318 0.033146 -0.29553 3.59392Distance 0.166276 0.143933 -0.02226 -0.06857 1.799046 1.02825• The numbers on the diagonals are variances (except they are divided by n),and all other numbers are covariances. The matrix is symmetric, so onlynumbers on one side of the diagonal are shown.Correlation• We can also use the spreadsheet to find the sample correlation matrix, andthe procedure is identical to that of finding the covariance, except that wechoose the Correlation Analysis Tool.• Here is the correlation matrix for the pizza example.Day Hour Prep Time Wait Time Travel Time DistanceDay 1.0000Hour 0.0000 1.0000Prep Time 0.1019 -0.0811 1.0000Wait Time 0.1746 -0.1122 0.0905 1.0000Travel Time 0.0516 0.0508 0.0166 -0.0479 1.0000Distance 0.0830 0.0618 -0.0208 -0.0208 0.9359 1.0000• The off-diagonal terms are the sample correlation coefficients between pairsof variables. Excel does these computations correctly and no adjustmentsare necessary.
Stat 5969 Statistical Software PackagesAnalysis Tools - 11Summarizing Qualitative Data in Tables• Excel has a utility called a Pivot Table that allows us to create and analyzetabular summaries (contingency tables) of qualitative data. It can also beused with quantitative data or combinations of quantitative and qualitativedata.• To use the pivot table feature, data must be entered in columns and eachcolumn must have a title or header. Before invoking the procedure, be surethat the cursor is in one of the cells containing a header or data.• To start the “wizard,” go to Data/PivotTable and PivotChart Report. In thefirst step, just click on Next (the default values are what we want). In thesecond step, verify that the data range shown contains all of the data thatyou want to analyze, then click on Next again.• In step 3, click on the button called “Layout.” You will be presented withthe following dialog box (except the buttons on the right will changeaccording to the data set you are using).
Stat 5969 Statistical Software PackagesAnalysis Tools - 12• At this point, click on and drag the button corresponding to the variable thatyou want to be on the rows of your output table to the area labeled “Row”and the variable you want in columns to the area that says “Column.” Thendrag either of the two buttons that you just used to the “Data” area. Irecommend always dragging one of the qualitative variables’ buttons. Thebutton should change to say “Count of VARIABLE” “where VARIABLE is thename of the variable that you dragged to the middle. Then say OK.• To complete the procedure there are a few other options you can change ifyou desire, but I usually just click on Finish at this point and change optionslater if the output is not what I desire. If you have used a quantitativevariable, you will likely want to group it. To do so, right click on thevariable name in the table. One item in the pop-up menu should say Group.Choose it, and then specify how you want the variable to be grouped.• The pivot table can display several different types of summary measues.The default or “normal” state is to display total counts. There may be timesthat you want to display the numbers in the table as overall percentages, asrow percentages, etc. To change the display, click any where in the tableand go again to the Data/PivotTable and PivotChart Report menu item. Youshould be at step 3 again. Click on Layout and then double click what is inthe middle of the table (it should say “Count of…”). Then select options.A drop down menu that says “Show Data As” will be in the middle of thedialog box. Use the drop down menu to say how you want to display thedata. Then exit out of all of the boxes.• The default way that Excel lists the categories in qualitative variables isalphabetically. You may want them listed in some kind of logical ascendingorder (for example, you may want to list class standing as Freshman,Sophomore, Junior and Senior). To tell Excel how you want the labels tobe ordered, go to the Tools menu, select options, and then click on the tabcalled “Custom Lists.” Then you can type in the list items in the order youwant them (separate them with a comma or return) in the List Entries section.Or you can import the list in the order that you want by identifying the cellswhere they are listed.
Stat 5969 Statistical Software PackagesAnalysis Tools - 13• Below is a portion of an Excel worksheet with both qualitative andquantitative variables. It shows both a portion of the original data and thethe resulting pivot table. I created a custom list in Excel as “Good, VeryGood, Excellent.”Random Sampling• We can obtain a random sample from a set of data using the analysis tools.The tool is called "Sampling." Before using the tool, I suggest including acolumn in the data file that is a numbered label. After selecting DataAnalysis, choose the Sampling tool. Next indicate the location of thenumbers to be sampled from (which would be the location of the datalabels), input the first cell of the output block, choose random (rather thanperiodic), then indicate how many samples you want to draw (i.e., thesample size). Then hit OK.• With the above procedure, it is possible to obtain repeated items in thesample (e.g., the same item could be drawn twice). That is why I use thelabel column rather than the original data column to create the sample. Thatway I can tell if I have duplicates. If I do obtain a duplicate, I simplycontinue to draw more samples until I have a sufficient number of distinctitems for the desired sample size.
Stat 5969 Statistical Software PackagesAnalysis Tools - 14• The best way I know to look for a duplicate is to sort the data. The sortroutine is under the DATA menu or can be found on the tool bar .• To find the actual data associated with the label, we can use the function=VLOOKUP. Suppose that my labels are in cells A2:A301 and the datafrom which I want the random sample is in cells B2:B301. Suppose alsothat I started the output from the Sampling tool in C2 and drew a sample of25 (so the sampled labels are in cells C2:C26). I will also assume that Idon’t have any duplicates. Then in cell D2 I would enter the function=VLOOKUP(C2,$A$2:$B$301,2). This function says look for what is incell C2 in the first column of A2:B301. When you find the number reportback what the corresponding number in the second column of A2:B301 (the2 is what tells it to report back what is in the second column). Then I wouldcopy cell D2’s contents down through cell D26.Inference Tools• The majority of the tools in Excel are for statistical inference. I will discussthe how to use the tools for confidence intervals on one mean, hypothesistests on one and two means, analysis of variance, and regression.Confidence Intervals• I have already described the Descriptive Statistics Tool, which is what weuse to do confidence intervals. The tool is useful for cases where we havethe data and we do not know the population standard deviation. Then weuse the first and last two numbers in the Descriptive Statistics output tocreate the confidence interval. The first number is the sample average. Thelast number, which Excel calls Confidence Level(xx%) (which I consider tobe a very poor name) is the margin of error. Below I have repeated part ofthe printout from above.
Stat 5969 Statistical Software PackagesAnalysis Tools - 15HeightMean 1.2Standard Error 0.01206M MCount 11Confidence Level(99.0%) 0.03822Hypothesis Test on One Mean:This procedure is used when you do not know the population standarddeviation and you have all of the data given. Before going to the Toolsmenu you need to add another column which consists only of thehypothesized value µ0, next to each value of the original data. The easiestway to do this is to enter µ0 once, and then use the fill down command toput it in the rest of the cells.Then from the Data Analysis Tools select "t-Test: Paired Two-Sample forMeans" in Excel. Variable 1 input will be the column where the original dataare located. Variable 2 input will be the column where the hypothesizedvalue is located. Indicate where you want the output to go, and give a levelof significance (α) value. The (hypothesized) difference should always be 0or can be left blank. Finally, if you labeled your columns and included themin the Variable 1 and Variable 2 input portions, then click the labels box.Example:Pineapple Corporation (PC) maintains that their cans have always containedan average of 12 ounces of fruit. The production group believes that themean weight has changed. The drained weights in ounces for a sample of 15cans of fruit from PC had a mean value of 12.09 and a standard deviation of.20. Use an appropriate hypothesis test to determine if the data showevidence of a change in mean weight. Use a significance level of .01. Theoutput is presented on the next page.
Stat 5969 Statistical Software PackagesAnalysis Tools - 17Testing Two Means (with unpaired or unmatched samples)If we want to test the relationship between two means, we have two choices:"t-Test: Two-Sample Assuming Equal Variance" or "t-Test: Two-SampleAssuming Unequal Variance." The choice obviously depends on what webelieve the relationship is between the population variances of the twogroups. Whatever we decide, the procedure in Excel is identical once wehave chosen made our choice. Variable 1 input will be the column (or row)where the first set of data is located. Variable 2 input will be the column (orrow) where the second set of data is located. Indicate where you want theoutput to go, and give a level of significance (α) value. The hypothesizeddifference will usually be 0, but not always. Finally, if you labeled yourcolumns (or rows) and included them in the Variable 1 and Variable 2 inputportions, then click the labels box.• Consider the following example. A manager is interested in determiningwhether the productivity of workers that work during two different shifts isthe same. To test her hypothesis, the manager randomly samples 8 workersfrom each shift and records the average time (in minutes) needed tocomplete a given assembly-line task, with the results given below.Shift 1 81.2 72.6 56.8 76.9 42.5 49.6 62.8 48.2Shift 2 56.6 58.6 45.4 39.1 42.8 65.2 40.7 49.9From the data, can we conclude that the two shifts have the sameproductivity level? It looks like the second shift completes the task in lesstime, but is the difference due to sampling, or because the mean times arereally different. The output from the two procedures is given on the nextpage.
Stat 5969 Statistical Software PackagesAnalysis Tools - 19Testing Two Variances• To do this type of problem on the computer, go to the Data Analysis Tools,and select "F-test Two-Sample for Variances." Variable 1 input will be thecolumn (or row) where the first set of data is located. Try to use thevariable with the largest sample variance as variable 1. Variable 2 input willbe the column (or row) where the second set of data is located. Give thevalue of α and indicate where you want the output to go. Finally, if youlabeled your columns (or rows) and included them in the Variable 1 andVariable 2 input portions, then click the labels box.• For this procedure, Excel only calculates one-sided values. If the test istwo-sided (as it usually is) you have two options. First, you can divide thegiven value of α by 2, and input the result as the level of significance. Thesecond option is to always use the p-value criterion and for a two-sided test,multiply the one-sided p-value by 2.• For the example:F-Test: Two-Sample for VariancesShift 1 Shift 2Mean 61.325 49.7875Variance 207.356 89.501Observations 8 8df 7 7F 2.317P(F<=f) one-tail 0.145F Critical one-tail 3.787
Stat 5969 Statistical Software PackagesAnalysis Tools - 20ANOVA:• Excel can do one and two-way analysis of variance. I only describe thesingle factor case below. If you are interested in two-way ANOVA, Excel’shelp should guide you through it. It should also be very similar to what isdescribed below.• After selecting Data Analysis, choose the option called, "Anova: SingleFactor" in Excel. Next specify the input block, which will contain the datafrom all groups. Each group should be in its own column or row. If thegroups have differing numbers of samples, be sure to highlight to include allsamples. Excel will handle the blank spaces without a problem. Indicatewhere to send the output, and then input a value of α. Check the boxindicating whether the groups are entered in columns or rows, and check thelabel box if you have included labels in your input block. Then start theprocedure.• ExampleThree different automatic milling machines at Castmetal, Inc. were set up tomill the same type of part. Observations were taken at random times to findout how many parts were being produced per hour by each machine. Onlyfour observations were taken on machine 3 since the inspector became illand had to go home before he could complete his work. These data wereentered into Excel in cells A1:C5. Can we conclude that the mean hourlyoutput for the three machines is different?Machine 1 Machine 2 Machine 3105 91 104105 99 106110 89 99107 95 109102 103
Stat 5969 Statistical Software PackagesAnalysis Tools - 21• Below is the dialog box and output for the example.Anova: Single-FactorSummaryGroups Count Sum Average VarianceMachine 1 5 529 105.8 8.7Machine 2 5 477 95.4 32.8Machine 3 4 418 104.5 17.6667ANOVASource of VariationSS df MS F P-value F critBetween Groups 313.8571 2 156.9286 7.882257 0.007518 7.205699Within Groups 219 11 19.90909Total 532.8571 13• Conclusions:
Stat 5969 Statistical Software PackagesAnalysis Tools - 22Regression:• Doing regression in Excel is very similar to using the other analysis tools.With regression, however, having the data in the right form is moreimportant. First, all data should be entered in columns. Second, allindependent variables should be next to each other (i.e., in a contiguous setof cells).Once the data are entered correctly, select "Regression" from the Tools/Data Analysis menu item in Excel. You will be presented with the dialoguebox shown below.In the Input Y Range, enter the cell range referring to the column containingthe dependent variable. In the Input X Range, enter the range of cellscontaining all independent variables. This is why the X variables need to benext to each other. If your range of cells included a row of labels, click thelabel box.
Stat 5969 Statistical Software PackagesAnalysis Tools - 23I never click the Constant is Zero box. In some physical systems it onlymakes sense for the intercept to be 0, so we can force it do so. In ourexamples that will never be the case. If you want a confidence interval forthe β values other than a 95% confidence interval, click in the ConfidenceLevel box and enter a different confidence level.Next, indicate where you want the output to go. Finally, click on the boxnext to “Residuals.” I leave all other boxes blank, because I don’t like theway that Excel does the rest of the residual analysis or the normal probabilityplot. Then hit enter.• Below is some sample output.Regression StatisticsMultiple R 0.936248R Square 0.87656Adjusted R Square 0.874991Standard Error 0.670277Observations 240Analysis of VariancedfSum ofSquaresMeanSquare F Significance FRegression 3 752.919 250.973 558.6225 7.1E-107Residual 236 106.028 0.449271Total 239 858.947CoefficientsStandardError t Statistic P-valueLower95%Upper95%Intercept 1.156832 0.229887 5.032169 9.54E-07 0.703939 1.609725Day -0.02521 0.022013 -1.14541 0.253183 -0.06858 0.018153Hour -0.00592 0.018919 -0.31297 0.754578 -0.04319 0.031351Distance 1.754525 0.042988 40.8147 1E-109 1.669837 1.839213