Statistics and Displays A Basic TutorialThis tutorial is designed as a refresher study forteachers but may be adapted for use with students.
Tutorial ContentsThis tutorial consists of vocabulary; data displays and their properties by grade level.The user of this tutorial may skip to desired sections and pages via embedded links. Links are indicated by black font and underlined text.
Vocabulary Outliers Measures of Central Tendency Measures of Spread Skewness Types of data Types of variablesBack to Contents
Outliers An outlier is a data point that lies outside the overall pattern of a distribution. Specifically, it is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile. An outlier may exist in both uni-variate data and bi-variate data. Back to Vocabulary
Measures of Central Tendency Mean Median ModeAll of these measures can describe the “average” of a data set, thus the term “average” is not to be synonymous with the term “mean.” More notes…Back to Vocabulary
Mean (Arithmetic Mean) The mean is the sum of all the values in the data set divided by the number of data points in the set. Mean is good measure for roughly symmetric sets of data. It may be misleading in skewed sets of data as it is influenced by extreme values. x1 x2 ...xn mean n Back to Measures of Central Tendency Back to Vocabulary
Median The median is the middle term in an ordered list of data points. It is the middle of a distribution of data values. Thus, half the scores lie on one side, and half lie on the other side. The median is less sensitive to extreme scores. It is a good measure to use when describing a set with extreme outlier values. Back to Measures of Central Tendency Back to Vocabulary
Mode The mode is the value that appears most frequently in the data set. More than one mode can exist when two (or more) values appear equally as often. Bi-modal, tri-modal, etc can be used to describe the number of modes in a data set when there is more than one. Mode is the ONLY measure of central tendency that can be used with nominal data. The mode greatly fluctuates with changes in a sample, and is not recommended as the only measure of central tendency to describe a data set. Back to Measures of Central Tendency Back to Vocabulary
Interesting Notes about Measures of Central Tendency In a normal distribution of scores, the mean, median and mode all have the same value. Mean is the most efficient measure to use in a normal distribution. Median is usually the best to use when the distribution is skewed with outliers.Back to Measures of Central Tendency Back to Vocabulary
Measures of Spread Range Variation Standard DeviationBack to Vocabulary
Range The range is the difference between the highest and lowest value in the data set. The range is highly sensitive to extreme scores (outliers) and, thus, is not good to use as the only measure of spread.Back to Measures of Spread Back to Vocabulary
Variation Variation is a measure of how spread out a distribution is. Variation is computed as the mean of the squared differences of each value from the mean of the set. Variation is basically a measure of how far apart, on average, each value is from the next value in the set. (X x1 ) 2 ( X x2 ) 2 ...( X xn ) 2 nBack to Measures of Spread Back to Vocabulary
Standard Deviation The standard deviation is computed as the square root of the variance. It is the best and most commonly used measure of spread for a data set because it takes into account all the data points rather than just the extreme ends. It is most often used as a measure of risk in real world applications such as stock investments. The standard deviation is very useful when working with a normal distribution. Back to Measures of Spread Back to Vocabulary
Skewness A distribution of data is skewed if one of the tails (ends) is longer than the other Positive skew – long tail in the positive (right) end Mean is larger than the median Negative skew – long tail in the negative (left) end Mean is smaller than the median Symmetric distributions look like the normal curve and are symmetrical on both tails Mean and median are equal Back to Vocabulary
Types of Data Uni-variate data The data is collected on only one variable. Bi-variate data The data is collected on two variables and plotted together for investigation.Back to Vocabulary
Types of Variables A categorical variable has values that are labels for a particular attribute (e.g., ice cream flavors). Nominal – categories are in no particular order Ordinal – categories are in a particular order A quantitative variable has values that not only are numerical but also allow descriptions such as mean and range to be meaningful (e.g., test scores). A discrete variable has only countable values (e.g., the number of students in a class). A continuous variable has numerical values that can be any of the values in a range of numbers (e.g., the speed of a car).Back to Vocabulary
Data Displays and their Properties 6th Grade 7th Grade 8th GradeBack to Contents
6th Grade Line plot Line graph Bar graph Stem and leaf Circle graph (sketch only)Back to Data Displays
7th Grade Line plot Line graph Bar graph Stem and leaf Circle graph Venn diagramBack to Data Displays
8th Grade Line plot Venn diagram Line graph Box and Bar graph whisker Stem and leaf Histogram Circle graph ScatterplotBack to Data Displays
Line PlotConsists of a horizontal number line of the possible data values; one X for each element in the data set placed over the corresponding value on the number line.Works well when the data is quantitative (numerical); there is one group of data (uni-variate); the data set has fewer than 50 values; the range of possible values is not too great.
Line Plot ExampleSuppose thirty people live in an The graph is easier to createapartment building. The ages of when the ages are placed inthe residents are below. order from largest to smallest as the values will appear on the58, 30, 37, 36, 34, 49, 35, 40, number line.47, 47, 39, 54, 47, 48, 54, 50,35, 40, 38, 47, 48, 34, 40, 46, 30, 34, 34, 35, 35, 35, 36, 37,49, 47, 35, 48, 47, 46 38, 39, 40, 40, 40, 46, 46, 47, 47, 47, 47, 47, 47, 48, 48, 48, 49, 49, 50, 54, 54, 58
Advantages of Line Plots The plot shows all the data. Line plots allow several features of the data to become more obvious, including any outliers, data clusters, or gaps. The mode is easily visible. The range can be calculated quite easily from this data display.
Disadvantages of Line Plots A line plot may only be used for quantitative (numerical) data. A line plot is not efficient when the data is large and/or the the range is large.
Questions to Ask Is the data skewed? How do the mean, median, and mode compare to each other? Are there any outliers, data clusters, or gaps in the data?Back to Data Displays
Line GraphConsists of paired values graphed as points on a plane defined by an x- and y-axis; line segments connecting the graphed points (much like a dot-to-dot).Works well when the data is paired (bi-variate); the data is continuous.
Line Graph Example 75 Johns Weight in Kilograms 74 73 72 71 70 69 68 67 66 65 1991 1992 1993 1994 1995 YearJohn weighed 68 kg in 1991, 70 kg in 1992, 74 kg in1993, 74 kg in 1994, and 73 kg in 1995.
Advantages of Line Graphs A line graph is a way to summarize how two pieces of information are related and how they vary depending on one another.
Disadvantages of Line Graphs Changing the scale of either axes can dramatically change the visual impression of the graph.
Questions to Ask As one variable (displayed on the x-axis) increases, what happens to the other variable (displayed on the y-axis)? What other trends in the data do you notice?Back to Data Displays
Bar GraphConsists of bars of the same width drawn either horizontally or vertically; bars whose length (or height) represents the frequencies of each value in a data set.Works well when the data is numerical or categorical; the data is discrete; the data is collected using a frequency table.
Contrast Bar Graphs with Case-Value Plots In a case-value plot, the length of the bar drawn for each data element represents the data value. In a bar graph, the length of the bar drawn for each data value represents the frequency of that value. Lenth of Six Cats 30 25 Length in Inches 20 15 10 5 0 A B C D E F Cat
Advantages of Bar Graphs The mode is easily visible. A bar graph can be used with numerical or categorical data.
Disadvantages of Bar Graphs A bar graph shows only the frequencies of the elements of a data set.
Questions to Ask Is the data skewed? What is the mode? What if the data were collected _____ instead of _______? Why do you suppose ______ appears only ____ times in the data set? What other conclusions can you draw about the data?Back to Data Displays
Stem and Leaf PlotConsists of Numbers on the left, called the stem, which are the first half of the place value of the numbers (such as tens values); Numbers on the right, called the leaf, which are the second half of the place value of the numbers (such as ones values) so that each leaf represents one of the data elements.Works well when the data contains more than 25 elements; the data is collected in a frequency table; the data values span many “tens” of values.
Stem and Leaf Plot Additional Notes A stem and leaf plot is also called a stem plot. It is usually used for one set of data, but a back-to-back stem and leaf plot can be used to compare two data sets. Data Data Set A Set B Leaf Stem Leaf 320 4 1567 The numbers 40, 42, and 43 are from Data Set A. The numbers 41, 45, 46, and 47 are from Data Set B.
Stem and Leaf Plot ExampleThe number of points scored by the Vikings basketball team this season:78, 96, 88, 74, 63, 86, 92, 66, 72, 88, 83, 90, 67, 81, 85, 94. Writing the data in numerical order may help to organize the data, but is 63, 66, 67, 72, 74, 78, 81, 83, 85, NOT a required step. 86, 88, 88, 90, 92, 94, 96 Separate each number into a stem The number 63 would be and a leaf. Since these are two digit represented as numbers, the tens digit is the stem Stem Leaf and the units digit is the leaf. 6 3 Group the numbers with the same Points scored by the Vikings stems. List the stems in numerical order. Title the graph. Stem Leaf 6 3 6 7 7 2 4 8 8 1 3 5 6 8 8 9 0 2 4 6
Advantages of Stem and Leaf Plots It can be used to quickly organize a large list of data values. It is convenient to use in determining median or mode of a data set quickly. Outliers, data clusters, or gaps are easily visible.
Disadvantages of Stem and Leaf Plots A stem and leaf plot is not very informative for a small set of data.
Questions to Ask Is the data skewed? Are there any outliers, data clusters, or gaps? What is the mode? What is the median? How would the median be effected by removing a particular data element? adding a particular data element? What other conclusions can you draw about the data?Back to Data Displays
Circle Graph also called Pie ChartConsists of a circle divided into sectors (or wedges) that show the percent of the data elements that are categorized similarly.Works well when there is only one set of data (uni-variate); comparing the composition of each part to the whole set of data.
Circle Graph Example Cars in School Parking Lot Color Number White 19 White Black 25 Black Gray 11 Gray Red 18 Red B lue 7 B lue Other Other 10 Total 90A proportion can be used to calculate the angle measure for eachsector. Using white as the example, 19 white cars compare to thetotal of 90 in the same way that 76 degrees compares to the totaldegrees (360) in a circle.
Advantages of Circle Graphs A circle graph can be used for either numerical or categorical data. A circle graph shows a part to whole relationship.
Disadvantages of Circle Graphs Without technology, a circle graph may be difficult to make. Each percent must be converted to an angle by calculating the fraction of 360 degrees. Then the correct angle must be drawn. A circle graph does not provide information about measures of central tendency or spread.
Questions to Ask How does each part compare to another? Why do you suppose ________ was selected more than _______? What conclusions can you draw about the data?Back to Data Displays
Venn DiagramConsists of circles containing the value of each set or group; overlapping or intersecting circles to illustrate the common elements in groups; any nonexamples displayed with a value outside of all circles.Works well when a relationship exists between different groups of things (sets).
Advantages of Venn Diagram A Venn diagram visually illustrates the relationship between different groups of things (sets). It shows the occurrence of sharing of common properties.
Disadvantages of Venn Diagram A Venn diagram provides little usefulness when there are no shared features among sets.
Questions to Ask How many elements are in each set? How many elements are common to set ___ and set ___? How many elements are in set ___ but not in set ___? What conclusions can you draw about the data?Back to Data Displays
Box and Whisker PlotConsists of the “five-point summary” (the least value, the greatest value, the median, the first quartile, and the third quartile); a box drawn to show the interval from the first (25th percentile) to the third quartile (75th percentile) with a line drawn through the box at the median; line segments, called the whiskers, connecting the box to the least and greatest values in the data distribution.Works well when there is only one set of data (uni-variate); there are many data values.
Box and Whisker Plot ExampleMath test scores 80, 75, 90, 95, 65, 65, 80, 85, 70, 100. Write the data in numerical order and Median find the five point summary.. median = 80 first quartile = 70 65, 65, 70, 75, 80, 80, 85, 90, 95, 100 third quartile = 90 smallest value = 65 Median of Lower Part, Median of Upper Part, largest value = 100 First Quartile Third Quartile Place a point beneath each of these 65 70 75 80 85 90 95 100 values on a number line. Draw the box and whiskers and 65 70 75 80 85 90 95 100 median line.
Box and Whisker Plot ExampleThe following set of numbers 52 is the lower quartileare the amount (arranged The lower quartile is the median offrom least to greatest) of the lower half of the values (18 27video games owned by each 34 52 54 59 61).boy in the club. 87 is the upper quartile 18 27 34 52 54 59 61 68 78 82 85 87 91 93 100 The upper quartile is the median of the upper half of the values (78, 68 is the median 82, 85, 87, 91, 93, 100).The median is the valueexactly in the middle ofan ordered set ofnumbers.
Advantages of Box and Whisker Plots Immediate visuals of a box-and-whisker plot are the center, the spread, and the overall range of distribution. Box plots are useful for comparing data sets, especially when the data sets are large or when they have different numbers of data elements.
Disadvantages of Box and Whisker Plots It shows only certain statistics rather than all the data. Since the data elements are not displayed, it is impossible to determine if there are gaps or clusters in the data.
Questions to Ask Is the data skewed? What is the median? How does the median compare to the mean? What other conclusions can you draw about the data?Back to Data Displays
HistogramConsists of equal intervals marked on the horizontal axis; bars of equal width drawn for each interval, with the height of each bar representing either the number of elements or the percent of elements in that interval. (There is no space between the bars.)Works well when data elements could assume any value in a range; there is one set of data (uni-variate); the data is collected using a frequency table.
Advantages of Histograms A histogram provides a way to display the frequency of occurrences of data along an interval.
Disadvantages of Histograms The use of intervals prevents the calculation of an exact measure of central tendency.
Questions to Ask What is the most frequently occurring interval of values? What is the least frequently occurring interval of values? What conclusions can you draw from the data?Back to Data Displays
ScatterplotConsists of paired data (bi-variate) displayed on a two-dimensional grid.Works well when multiple measurements are made for each element of a sample.
Additional Notes about Scatterplots If the relationship is thought to be a causal one, then the independent variable is represented along the x- axis and the dependent variable on the y-axis A scatterplot can show that there is a positive, negative, constant, or no relationship (correlation) between the variables. Positive: As the value of one variable increases, so does the other. Negative: As the value of one variable increases, the other decreases. Constant: As the value of one variable increases (or decreases), the other remains constant. No relationship: There is no pattern to the points.
Advantages of Scatterplots A scatter plot is one of the best ways to determine if two characteristics are related. A scatterplot may be used when there are multiple trials for the same input variable in an experiment.
Disadvantages of Scatterplots When a scatterplot shows an association between two variables, there is not necessarily a cause and effect relationship. Both variables could be related to some third variable that explains their variation or there could be some other cause. Alternatively, an apparent association could simply be a result of chance.
Questions to Ask Is there a relationship between the variables? If so, what kind? What predictions can you make about the data based on the graph?Back to Data Displays