Review of descriptive statistics


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Review of descriptive statistics

  1. 1. D E S C R I P T I V E S T A T I S T I C S P a g e | 1 1.0 INTRODUCTION In everyday life, whether at home or at work, we usually keep records or read reports. An item in the record or report is a fact that is expressed in terms of a numerical value or described by its quality or kind. That single item or fact is referred to as a datum. All these facts in a record or report are called data. Examples of data:  Color of the hair  Number of students in a class  Height and weight  Number of times you were absent from class 1.1Population and Sample In data-gathering phase, the information is taken from a unit, which is a part of a collection of all such units called a population. A population is consists of an entire set of objects, observations, or scores that have something in common. Some Definitions: Population – collection of all units from which the data is to be collected. Element – unit in a population Sample – subset or a representative part of the population. Frame – listing of all the elements of the population Census – complete enumeration in which every member of the population is included Sampling – or sample survey; only a part or a portion of the population is used to obtain data 1.2Definition of Statistics The word “statistics” is used in several different senses. In the broadest sense, “statistics” is branch of science that deals with the development of methods for a more effective way of collecting, organizing, presenting, and analyzing data. Data and how to deal with it is the main concern of statistics. In a second usage, a “statistic” is defined as numerical quantity (such as mean) calculated in a sample. The grade point average (GPA) is an example of a statistic. It is a value computed from a set of grades of a student in a particular semester. Illustration: If the data is the set of grades, then GPA is the statistic. Another numerical value that can be computed from the set of grades is the percentage passing. The percentage passing is also a statistic. From the same set of grades, the number of subjects that received a failing grade is another statistic. Taken together, the GPA, the percentage passing, and the number that received a failing grade are called statistics. Major Areas of Statistics 1. Descriptive Statistics – deals largely with summary calculations, graphical displays, and describing important features of a set of data. It does not attempt to draw conclusions about anything that pertains to more than the data themselves. 2. Inferential Statistics – concerned with making generalizations from information gathered from a small group of observations (sample) to a bigger group of observations (population). Two Main Methods: 1. Estimation - the sample statistic is used to estimate a population parameter - a confidence interval about the estimate is constructed. 2. Hypothesis Testing - a null hypothesis is put forward. - Analysis of the data is then used to determine whether to reject it. 1.3Variables A variable is any measured characteristic or an attribute that differs for different subjects. Those variables having cause-and-effect relationships are called independent variables and dependent variables.
  2. 2. D E S C R I P T I V E S T A T I S T I C S P a g e | 2 Types of Variables: 1. Qualitative Variables – sometimes called “categorical variables” - facts for which no numerical measure exists - expressed in categories or kind Examples:  color of the skin which can be black, brown, or white  person’s sex which can be male or female 2. Quantitative Variables – variables that can be expressed in numbers. - can be measured and counted. Examples:  person’s height and weight – can be measured  number of students in a class – can be counted Classification of Quantitative Variables 1. Continuous variable A continuous variable is one for which within the limits the variable ranges, any value is possible. Examples:  Time to solve a math problem is continuous since it could take 2 minutes, 2.13 minutes, etc. to finish a problem  Height is continuous since it could take 1.55 meters, 1.65 meters, etc. 2. Discrete variable A discrete variable is one that cannot take on all values within the limits of the variable Examples:  Responses to a five-point rating scale is discrete since it can only take 1, 2, 3, 4 and 5.  Number of provinces 1.4Types of Measurements 1. Nominal measurement is consists of assigning items to groups or categories. No quantitative information is conveyed and no ordering of the items is implied. Nominal measurements are therefore qualitative rather than quantitative. Nominal measurement is the lowest form of measurement. Examples:  Color  Sex  Blood type  Religion 2. Ordinal Measurement Measurement in ordinal scales are ordered in the sense that higher number represent higher values. However, the intervals between the numbers are not necessarily equal. For example, on a five-point rating scale measuring attitudes towards gun control, the difference between a rating of 2 and a rating of 3 may not represent the same difference as the difference between a rating of 4 and a rating of 5. There is no “true” zero point for ordinal scales since the zero point is chosen arbitrarily. The lowest point on the rating scale in the example was chosen to be 1. It could just as well have been 0 or 5. Examples:  Taste preferences  Satisfactions  Social classes  Academic honors 3. Interval Measurement On interval measurement scales, one unit on the scale represents the same magnitude on the trait or characteristic being measured across the whole range of scale. Interval scales do not have a “true” zero point, however, and therefore it is not possible to make statements about how many times higher one score is than the another. A good example of interval scale is the Fahrenheit scale for temperature. 4. Ratio Measurement Ratio measurements are like interval measurement except they have true zero point. It is the highest form of measurement.
  3. 3. D E S C R I P T I V E S T A T I S T I C S P a g e | 3 Examples:  Length  Weight Note: A large number of statistical analysis tools are available for each type of measurements. It is important that the statistical user has a good understanding of the type of data that is to be processed in order that the statistical tool that is chosen is used properly. 1.5 Random and Non-Random Sampling  Random sampling is the most commonly used sampling technique in which each member in the population is given an equal chance of being selected in the sample.  Non-random sampling is the method of collecting a small portion of the population by which not all the members in the population are given the chance to be included in the sample. Properties of Random Sampling 1. Equiprobability – means that each member of the population has an equal chance of being selected and included in the sample. 2. Independence – means that the chance of one member being drawn does not affect the chance of the other member. 1.6 Probability Sampling Techniques 1. Simple Random Sampling (SRS) – process for selecting a sample wherein every element in the sampled population is given an equal chance of being included in the sample 2. Systematic Random Sampling – sampling wherein every kth unit is included after a random start is taken for the sample 3. Stratified Proportional Random Sampling – population is divided into homogeneous groups of strata and selection is done within each stratum 4. Multi-stage Sampling – this technique uses several stages or phases in getting sample from the population. This method is an extension or a multiple application of the stratified random sampling technique. 1.7Non-random Sampling Techniques 1. Judgment or Purposive Sampling – this method is also referred as non-probability sampling. It plays a major role in the selection of a particular item and in making decisions in cases of incomplete responses or observation. 2. Quota Sampling – this is a relatively quick and inexpensive method to operate since the choice of the number of subjects to be included in a sample is done at the researcher’s own convenience or preference and is not predetermined by some carefully operated randomizing plan. 3. Cluster Sampling – population is divided into a number of relatively small subdivisions, which are themselves clusters of still smaller units, and then some of these subdivisions, or clusters, are randomly selected for inclusion in the overall sample. 4. Incidental Sampling – this design is applied to those samples which are taken because they are the most available. 5. Convenience Sampling – this method has been widely used in television and radio programs to find out opinions of TV viewers and listeners regarding a controversial issues. 1.8 Methods of Collecting Data There are many ways of collecting data, each of which has its own advantages and disadvantages. The more general methods of collecting informations are: 1. Direct or Interview Method A very common and effective method of obtaining informations is by conducting interviews. People usually respond when visited in person. Disadvantages: People may tend to lie and interviews are quite costly and needs thorough training of the interviewers (untrained interviewers tend to influence the respondent’s answers). 2. Indirect or Questionnaire Method Questionnaires can either be mailed or handed personally to respondents. Advantages: It does not require interviews and is therefore less costly. It also cover wider area than interviews.
  4. 4. D E S C R I P T I V E S T A T I S T I C S P a g e | 4 Disadvantages: Response rate is usually lower than interview. Many people tend to ignore mailed questionnaires. To encourage participation, a questionnaire should be kept short as possible and contain questions related to the objectives of the survey. 3. Direct Observation In situations where less personal responses are needed, collecting data by direct observation may be used. Disadvantage: Assigned person to observe may commit some observational errors. 4. Experimentation – is used when the objective is to determine the cause-and-effect of a certain phenomenon under some controlled conditions. 5. Utilizing Existing Records A very convenient way of obtaining data is by utilizing existing records. There are number of institutions that gather data not only for their own purposes but for purposes of other group of people. Advantage: It is very economical and requires less cooperation from people. Disadvantage: Informations needed may not be found in these sources. Data are sometimes obtained in published/unpublished document and can be classified as follows:  Primary sources – provide data first hand; data gathered originally have not been subjected to some transcription or condensation. Its authenticity is guaranteed by the group who gathered it originally.  Secondary sources – provide data that have been transcribed or compiled from original sources 2.0 ORGANIZATION AND PRESENTATION OF DATA After data have been gathered and checked for possible errors, the next logical step is to present the data in a manner that is easy to understand. It should also readily convey the relevant information and the important results at a glance. Ways/Methods of presenting data: 1. Textual presentation – a narrative way of describing the collected characteristics of the population based on the data collected and organized 2. Tabular presentation – data are tallied into the appropriate row and/or column categories 3. Graphical presentation – data are presented graphically such as bar chart, histogram, pie chart and pictograph 2.1 Textual Presentation Example: A total of 22.4 million children aged 5-17 years old in 9.6 million households were estimated from the 1995 National Survey of Working Children (NSWC). Sixteen percent (16%) or 3.6 million children were reported engaged in economic activities at any time in 1995. Boys were more likely to work than girls with a national sex ratio of working children of 187. 2.2 Tabular Presentation - may be in the form of a cross tabulation table, a frequency distribution table (FDT) or a stem-and-leaf plot. 2.2.1 Cross Tabulation Table When a data are in categories, results are usually presented in systematic manner by using a table, which arranges data in rows and columns.
  5. 5. D E S C R I P T I V E S T A T I S T I C S P a g e | 5 Example: Table 1. Numbers of Subjects Falling Into Smoking/Lung Cancer Combination Smoker Lung Cancer Present Absent Total Yes 688 650 1338 No 21 59 80 Total 709 709 1418 A table contains: 1. Heading Heading includes a table number and a title. A Table number is necessary to easily identify the table. It should be followed by a title, which briefly de describes the contents of the table. 2. Body The body is the main part of the table. It contains row categories (which are found in the left side of the table) and the column categories (which are found at the top of the table). Row totals may also be included and is located in the right side of the table. A column total may also be included and is located at the top of the table. The figures found in the cells of the main body are usually the frequencies, representing the number of time the two categories occur together. Percentages can be used instead of frequencies. Or use both percentages and frequencies. 3. Footnote (optional) The data used may have been taken from some publications of provided by another group of person. Footnotes may be added to indicate the source of information. Contingency Table – a table listing the frequencies for the different combination of values of two categorical variables. 2.2.2 Frequency Distribution In many instances, information gathered is numerical in nature, such as age respondent or exam score of a student. When faced with a large set of this kind of data, it is often advantageous to group the data into a number of classes of intervals so as to get a better overall picture. Table 2.3 Scores in a Statistics Final Exam 31 28 15 10 47 18 32 29 58 48 37 49 26 54 56 21 24 28 32 28 43 12 23 29 61 16 42 40 32 26 48 36 39 22 40 20 63 54 30 17 18 30 23 26 36 47 19 25 38 35 Table 2.3 is a set of scores in the exam of Statistics. The above data will be used in illustrating the construction of a frequency table. Frequency distribution – is a grouping of all observations into interval or classes together with a count of the number of observations that fall in each interval or class. Data in Table 2.3 is called raw data and such form is difficult to read and analyze. In frequency distributions the data is presented in a more compact and usable manner. However, this process brings about some loss of details. 1.1 Steps in Constructing a Frequency Distribution 1. From the data set, identify the highest value and lowest value. Compute the range R as R = highest value – lowest value 2. Estimate the number of classes, k as
  6. 6. D E S C R I P T I V E S T A T I S T I C S P a g e | 6 nk  Note: The results are “rounded off” to the next higher integer, NOT the usual nearest integer. Rounding off to the nearest integer will often yield a number of intervals that cannot accommodate all the observations. 3. Estimate the width c of the interval by dividing the range R by the number of classes k. That is, k R c  Round off this estimate to the same number of significant places as the original data set. No. of decimal places of the raw data Precision 0 1 1 0.1 2 0.01 3 0.001 4. List the lower and upper class limits of the first interval. This interval should contain the smallest observation in the data set. The starting lower limit could be the lowest or any number closest to it. 5. List all the class limits by adding the class width to the limits of the previous interval. The highest class should contain the largest observation in the data set. 6. Tally the frequencies for each class. 7. Compute the class marks and the class boundaries. Class midpoint, or class mark is the midpoint of an interval. That is, 2 ULLL CM   where, CM – class mark LL – lower limit UL – upper limit To find class boundaries, it is important to know the unit of accuracy of the raw data. The final exam scores are accurate to the ones unit. The value reported as 5.8 kg. is accurate to the tenth unit, while a GPA of 2.64 is accurate to the hundredth unit. Lower class boundary, Li, is given as Li = LL – 0.5 (Precision) Upper class boundary, Ui, is given as Ui = UL + 0.5 (Precision) Additional columns may be added to obtain additional information about the distributional characteristics of the data. Among these are: a) Relative Frequency (RF) – frequency of a class expressed in proportion or percentage of the total number of observations. That is, n f RF i  where fi is the frequency in each interval b) Cumulative Frequency (CF). This is the accumulated frequency of a class. There are two types: The “less than” CF (<CF) of a class is the number of observations whose values are less than or equal to the upper limit of the class. The “greater than” CF (>CF) of a class is the number of observations whose values are greater than or equal to the lower limit of the class.
  7. 7. D E S C R I P T I V E S T A T I S T I C S P a g e | 7 2.3 Graphical Presentation This form is the most effective means of organizing and presenting data because the important relationships are brought out more clearly and creatively in virtually solid and colorful figures. 2.3.1 Different Kinds of Graphs/Charts 1. Line Graph – it shows relationships between two sets of quantities. This is done by plotting point of X set of quantities along the horizontal axis against the Y set of quantities along the vertical axis in a Cartesian coordinate plane. Those plotted points will be connected by a line segment which finally forms the line graphs. 2. Bar Graph – it consists of bars or rectangles of equal widths, either drawn vertically or horizontally. 3. Circle Graph or Pie Chart – it represents relationships of the different components of a single total as revealed in the sectors of a circle. 4. Picture Graph or Pictogram – it is a visual presentation of statistical quantities by means of drawing pictures or symbols related to the subject under study. 2.3.2 Graphical Representation of the Frequency Distribution 1. Bar Chart and Histogram - is one of the more popular ways of representing a frequency distribution graphically. It is a graph where the different classes are represented by the class limits in the horizontal axis or categories for nominal data. The length of the rectangle, represented by the class frequency is drawn in the vertical axis. A graph that is close resemblance of the bar graph is the histogram. The basic difference is: a bar chart uses class limits for the horizontal axis while the histogram employs the class boundaries. Using the class boundaries, it eliminates spaces between the rectangles giving it a solid appearance. 2. Frequency Polygon - is constructed by plotting the class marks against the frequency. The set of (x,y) points formed the class marks and their corresponding frequencies are connected by straight lines. To complete the polygon, which is defined as closed figure, an additional class mark is added at the beginning and at the end of the distribution. 3. Frequency Ogive - A cumulative frequency distribution can be represented graphically by a frequency ogive. An ogive is obtained by plotting the upper class boundaries on the horizontal scale and the cumulative frequency less than the upper class boundaries in the vertical scale. 3.0 NUMERICAL DESCRIPTION OF DATA It is a numerical value that summarizes a set of observations into a single value, and that value may be used to represent the entire population. 3.1 The Summation Symbol The Greek letter ‘ ’ ( upper case sigma) denotes the summation symbol. It is a more compact way of writing a sum of a set of data values. A convenient way of writing a data value in mathematical notation is the subscripted variable ix , which is read as ‘ x sub i ’. When a set of data values are written in the subscripted variable notation nxxxx ,...,,, 321 , the notation  n i ix 1 is defined as n n i i xxxxx  321 1 . The symbol  n i ix 1 is read as ‘the summation of x sub i from 1 to n ’. Example: Consider the set of data values 5, 4, 8 and 6 which are measurements of weights. Find the following: 1.  4 1i ix 2.  4 1 2 i ix 3. 24 1       i ix 3.2 Measures of Central Tendency It is a single value about which the set of observation tend to cluster.
  8. 8. D E S C R I P T I V E S T A T I S T I C S P a g e | 8 3.2.1 ARITHMETIC MEAN The arithmetic mean or simply mean, is the sum of a set of measurements divided by the number of measurements in the set. This measure is appropriate for the data in the interval or ratio scale. a. Population mean; N x N i i  1  b. Sample mean; n x x n i i  1 c. Weighted mean;      k i i k i ii w f xf x 1 1 d. Grand mean;      k i i k i ii n xn x 1 1 Examples 3.2.1: 1. The number of hours spent by ten students in studying per day were recorded as follows: 5, 8, 2, 2, 2, 6, 5, 3, 1, and 4. Find the mean. 2. The following table shows the number of households in the five (5) Barangays in Iligan City in 2010, and corresponding percentage changes in the number of households 2010 – 2012. Barangay Number of Households Percentage Change Tibanga 11,802 9.1 Suarez 8,624 8.3 Hinaplanon 5,326 4.5 Digkilaan 894 1.4 Palao 12,012 10.6 Find the weighted mean of the percentage changes. 3.2.2 MEDIAN The median is not affected by the presence of abnormally large or abnormally small observations. It is the middle value of a set of observations arranged in an increasing or decreasing order of magnitude. It is the middle value when the number of observations is odd if it is even i. e. it is the value such that half of the observations fall above it and half below it. a. Population Median: ~ = ., 2 1 , 1 22 2 1 evenisNifxx oddisNifx NN N                             b. Sample Median: x~ = ., 2 1 , 1 22 2 1 evenisNifxx oddisNifx nn n                            
  9. 9. D E S C R I P T I V E S T A T I S T I C S P a g e | 9 3.2.3 MODE – is the value which occurs the most number of times, or the value with the greatest frequency. Remarks 3.2.1 1. When mean, median, and mode equal in a given data set then the data set is said to be normally distributed. 2. The graph of the said data is a symmetrical bell-shaped curved. 3.3 Measures of Variability or Dispersion They are numerical values computed from the given observations that measures how the data spread from the central location. 3.3.1 RANGE – is the difference between the largest and the smallest values in the set. It is denoted by R i.e., R = Highest Value – Lowest Value 3.3.2 VARIANCE – is the average squared differences of the scores from the mean score of a distribution. a. Population Variance. Given the finite population x1, x2,…,xN the population variance is: 2  =   2 1 N x N i i   For ease of computation, an alternative form is suggested below: 2  = N Nx N i i  1 22  b. Sample Variance. Given the random sample x1, x2,…,xn , the sample variance is: 2 s =   2 1 n xx n i i  A computationally faster form is  1 1 2 1 2 2             nn xxn s n i n i ii Note that in sample variance the denominator is involving “n – 1”, this is because using only “n” to solve sample variance will underestimate the variance and would create a bias. 3.3.3 STANDARD DEVIATION – is the positive square root of the variance. a. population standard deviation : 2   b. sample standard deviation : 2 ss  3.3.4 COEFFICIENT OF VARIATION (denoted by CV) – is a measure of relative variation expressed as percentage. It is the ratio of the standard deviation and the mean multiplied by 100%. a. %100   CV b. %100 x s CV Examples 3.3.4 1. The final examination given to two sections of Math 2 gave the following mean and standard deviation: Statistics Section A Section B Mean 30 46 Standard Deviation 10 12 Find the coefficient of variation of the two sections and determine which of the two sections has greater variability of scores.
  10. 10. D E S C R I P T I V E S T A T I S T I C S P a g e | 10 2. The mean height of college women is 157.48 cm. with a standard deviation of 6.35 cm., while their mean weight is 47.70 kg. with a standard deviation of 3.64 kg. Which is more variable, the height or the weight of the college women? 3.3.5 Characteristics of the Standard Deviation The standard deviation and variance are the most commonly used in measures of dispersion in the social sciences because: 1. Both take into account the precise difference between each score and the mean. 2. If any single score is change, the standard deviation changes. If the score is moved away from the mean the standard deviation increases. Otherwise, decreases. 3. If a score is added that is far from the mean the standard deviation increases. Otherwise, decreases 3.3.6 Interpreting the Standard Deviation The standard deviation is very important regardless of the mean. It makes a great deal of difference whether the distribution is spread-out over a broad range or bunched up closely around the mean. Figure 3.1, shows set scores which are normally distributed. Figure 3.1 A Normal Curve Showing the Percent of Cases Lying Within 1, 2, and 3 Standard Deviations From the Mean Chebyshev’s Theorem The accuracy and the position of the scores in frequency distribution relative to the mean can be determined by using the Chebyshev’s Theorem Chebyshev’s Theorem: Chebyshev’s theorem states that the proportion or percentage of any data set that lies within k standard deviations of the mean (where k is any positive integer greater than 1) is at least . 1 1 2 k  For any data set, at least 88.9% of the data lie within three standard deviations to either side of its mean. Example If the mean score of the students enrolled in Statistics class is 66 points with standard deviations of 5 points, at least what percentage of the scores must lie between 46 and 86? Solution:     4 54666 46566 46     k k k Skx Hence from Chebyshev’s Theorem, %75.93 16 15 4 1 1 1 1 22  k
  11. 11. D E S C R I P T I V E S T A T I S T I C S P a g e | 11 3.4 Other Measures of Location (Quantiles or Fractiles) The measures of central tendency refer only to the center of the entire set of data, but there are other measures of location that describes or locate the non-central position of this set of data. These measures are referred to as quantiles or fractiles. In this section, we will consider the fractiles, which can be a percentile, a decile, or a quartile. 3.4.1 Percentiles – are values that divide an ordered set of observations into 100 equal parts. These values, denoted by P1, P2, … , P99, are such that 1 % of the data falls below P1, 2% falls below P2,…, and 99 % falls below P99. 3.4.2 Deciles – are values that divide an ordered set of observations into 10 equal parts. These values denoted by D1, D2, …, D9, are such that 10 % of the data falls below D1, 20 % falls below D2, …, and 90 % falls below D9. 3.4.3 Quartiles – are values that divide an ordered set of observations into 4 equal parts. These values, denoted by Q1, Q2, and Q3, are such that 25 % of the data falls below Q1, 50 % falls below Q2, …, and 75 % falls below Q3. Procedure for the computation of the fractiles: 1. Arrange the data in an increasing order of magnitude. 2. Solve for the value of L, where           Quartilesfor mn Decilesfor mn sPercentilefor mn L ' 4 , 10 , 100 where: m is the location of the percentile, decile, or quartile n is the number of observations. 3. If L is an integer, the desired fractile is the average of the Lth and the (L + 1)th observations. If L is fractional, get the next higher integer to find the required location. The fractile corresponds to the value in that location. Remark 3.4: 1. Semi-Interquartile Range represents the distance on a scale between Q1 and Q3. 2. Quartile Deviation is the half of semi-interquartile range. 3.5 Skewness and Kurtosis Skewness is the degree of departure from symmetry of a distribution. Kurtosis is the degree of peakedness of distribution. 3.5.1 Symmetric Distribution (those where one side is the mirror image of the other) when presented graphically will show normal curves. They have a mean and a median that have the same value. If the distribution is symmetric and unimodal, the mode also has the same value as the mean and median (see Graph 1 in Figure 4.1). 3.5.2 Skewed Distribution – have different values for the mean, median, and mode. For unimodal skewed distributions, the mean is pulled toward the tail, and the median is between the mean and mode. Figure 4.1 Graphs of Different Type of Distribution
  12. 12. D E S C R I P T I V E S T A T I S T I C S P a g e | 12 Remarks 3.4 1. A positively skewed distribution has “tail” which pulled in positive direction (see Graph 3 in Figure 4.1). 2. A negatively skewed distribution has “tail” which pulled in negative direction (see Graph 2 in Figure 4.1). 3. A symmetric distribution has zero skewness. 4. A normal distribution is a mesokurtic distribution. 5. A pure leptokurtic distribution has a higher peak than the normal distribution and has heavier tails. 6. A pure platykurtic distribution has a lower peak than a normal distribution and lighter tails. 3.5.3 Application of Measuring Skewness and Kurtosis One application is testing for normality: many statistics inferences require that a distribution be normal or nearly normal. A normal distribution has skewness and excess kurtosis of 0, so if your distribution is close to those values then it is probably close to normal. 3.5.4 Calculating Skewness The moment coefficient of skewness of a data set is skewness: . 3 2 3 1 m m g  where:   n xx m n i i   1 3 3 x̄ - is the mean and n is the sample size, as usual. m3 - is called the third moment of the data set. m2 - is the variance. Note: Remember that you have to choose one of two different measures of standard deviation, depending on whether you have data for the whole population or just a sample. The same is true of skewness. If you have the whole population, then g1 above is the measure of skewness. But if you have just a sample, you need the sample skewness:   11 2 1 g n nn G     3.5.5 Interpreting Skewness 1. If skewness is positive, the data are positively skewed or skewed right, meaning that the right tail of the distribution is longer than the left. 2. If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail is longer. 3. If skewness = 0, the data are perfectly symmetrical. 4. But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness number? Bulmer, M. G., Principles of Statistics (Dover,1979) — classically suggests this rule of thumb: a. If skewness is less than −1 or greater than +1, the distribution is highly skewed. b. If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed. c. If skewness is between −½ and +½, the distribution is approximately symmetric. Inferring Your data set is just one sample drawn from a population. Maybe, from ordinary sample variability, your sample is skewed even though the population is symmetric. But if the sample is skewed too much for random chance to be the explanation, then you can conclude that there is skewness in the population. To answer that, you need to divide the sample skewness G1 by the standard error of skewness (SES) to get the test statistic, which measures how many standard errors separate the sample skewness from zero:
  13. 13. D E S C R I P T I V E S T A T I S T I C S P a g e | 13 test statistic:      312 16 ,1 1    nnn nn SES SES G Z g The critical value of Zg1 is approximately 2. (This is a two-tailed test of skewness ≠ 0 at roughly the 0.05 significance level.)  If Zg1< −2, the population is very likely skewed negatively (though you don’t know by how much).  If Zg1 is between −2 and +2, you can’t reach any conclusion about the skewness of the population: it might be symmetric, or it might be skewed in either direction.  If Zg1 > 2, the population is very likely skewed positively (though you don’t know by how much).
  14. 14. D E S C R I P T I V E S T A T I S T I C S P a g e | 14 CASE STUDIES: Case Study1 1. A study was conducted to see how well reading success in first grade could be predicted from various kinds of information obtained in kindergarten: age, sex, tribe, academic rank, and IQ. Which of the variables represents a a. nominal scale b. ordinal scale c. interval scale d. ratio scale 2. Are the following variables discrete or continuous? a. The number of correct answers on the true-false test. b. The duration of the effectiveness of a pain medication. c. The number of commercials aired daily by a television station. d. The weights of Sunday newspaper. e. The heights of basketball players. 2. Among 250 employees of the local office of an international insurance company, 182 are whites, 51 are blacks, and 17 are Orientals. If we use the stratified random sampling to select a committee of 15 employees, how many employees must we take from each class? 3. Suppose you were asked to make a study on the brand preferences and satisfaction of the customers of famous laundry soaps in four (4) different supermarkets. a. Arrange the letters of the following steps to statistical inquiry in a logical way. A. Collecting relevant information B. Defining a problem C. Interpreting the data D. Analyzing the data E. Organizing and presenting data b. Who will be the most appropriate respondents of the study? c. How will you apply multi-stage sampling to the population of the study? e. Calculate the sample size if the population size is 2000 and the margin of error is 5%. Case Study2 1. Create a textual presentation based from the table shown below. Suppose there are 800 million users per day. 2. Create tabular and (any) graphical presentations of the textual presentation as presented below. “The top three regions in terms of population count are Region IV-Southern Tagalog (11.32 million or 15.04% of the total), NCR (10.49 million or 13.93%), Region III – Central Luzon (7.80 million or 10.35%). The population residing in these regions combined comprises 39.32% of the total Filipino population. This means that four out of ten persons in the country reside in NCR and the adjoining regions of Central Luzon and Southern Tagalog.”
  15. 15. D E S C R I P T I V E S T A T I S T I C S P a g e | 15 3. Using the table below Table 2.5 Number of Passengers for P&P Airlines 68 72 50 70 65 83 77 78 80 93 71 74 60 84 72 84 73 81 84 92 77 57 70 59 85 74 78 79 91 102 83 67 66 75 79 82 93 90 101 80 79 69 76 94 71 97 92 83 86 69 a. Construct a frequency distribution table (with the class interval, frequency, class boundaries, class marks and cumulative frequency) for the given data. b. Construct its bar graph, histogram, frequency polygon, and frequency ogive. c. Determine whether the given data set is normally distributed. 3. Given the frequency polygon below. a. Reconstruct the frequency distribution table. b. Construct the frequency histogram. c. Give the answers of the following: i. What is the lower class limit of the lowest class? ii. What is the lower class boundary of the highest class? iii. What is the class width? Case Study3 1. A random sample of 10 students was given a special test. The time in minutes it took the students to finish the exam were taken and are given as follows: Find the following: a) Mean b) Median c) Variance d) Standard Deviation e) Range f) Mode g) Coefficient of Variation h) 18th Percentile i) 7th Decile j) 3rd Quartlie FREQUENCY CLASS MARKS 6 10 12 14 21.2 22.9 24.6 26.3 28 29.7 31.1 34.8 36.5 0 15 30 26 40 35 19 22 28 17 38
  16. 16. D E S C R I P T I V E S T A T I S T I C S P a g e | 16 2. Suppose that you are investigating the influence of interactive approach on the students’ mathematics performance. Consider the following samples of students’ final examination scores taken from three (3) sections of Math 1 enrolled during the first semester of SY 2011 – 2012. Sections Sample Scores Rizal 19 8 7 2 19 29 36 20 3 14 Bonifacio 14 25 12 32 13 17 10 22 13 32 Luna 24 13 20 1 8 28 16 21 23 26 a. Describe the performance of each section by their respective mean and standard deviation. b. Which of these 3 sections showed great improvements of the students’ performance in mathematics? Explain why? 3. Table shown below is the distribution of the responses of your respondents in the emotional intelligence inventory. Emotional Intelligence Inventory Indicators Almost Never Seldom Sometimes Usually Almost Always (1) (2) (3) (4) (5) 1. I appropriately communicate decisions to stakeholders. 11 9 15 5 9 2. I fail to recognize how my feelings drive my behavior at work. 18 2 10 12 8 3. When upset at work, I still think clearly. 5 6 15 14 8 4. I fail to handle stressful situations at work effectively. 10 12 8 14 6 5. I understand the things that make people feel optimistic at work. 18 2 13 7 10 6. I fail to keep calm in difficult situations at work. 21 12 8 9 0 7. I am effective in helping others feel positive at work. 1 4 16 19 10 8. I find it difficult to identify the things that motivate people at work. 15 12 5 8 5 1. Find the weighted mean of each statement. 2. Set-up a Likert scale with 5 intervals to interpret the results by assigning a descriptive equivalent such as “very low”, “low”, “average”, “high”, “very high”. 3. Find the weighted mean of each statement. 4. Find the standard deviation of each item. 5. Find the grand mean. 6. Interpret the results.