Upcoming SlideShare
×

# Project examples for sampling and the law of large numbers

4,089 views
3,730 views

Published on

A whitepaper on the way sampling can save money for projects and project managers.

Published in: Business, Economy & Finance
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
4,089
On SlideShare
0
From Embeds
0
Number of Embeds
353
Actions
Shares
0
47
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Project examples for sampling and the law of large numbers

1. 1. Examples for the Project ManagerSAMPLING AND THE LAW OF LARGE NUMBERS The Law of Large Numbers, LLN, tells us it‟s possible to estimate certain information about a population from just the data measured, calculated, or observed from a sample of the population. Sampling saves the project manager time and money, but introduces risk. How much risk for how much savings? The answer to these questions is the subject of this paper. A whitepaper by John C. Goodpasture, PMP Managing Principal Square Peg Consulting, LLC
2. 2. Sampling and the Law of Large Numbers Examples for the Project Manager The Law of Large Numbers, LLN, tells us it‟s possible to estimate certain information about a population from just the data measured, calculated, or observed from a sample of the population. A population is any frame of like entities. For statistical purposes, entities should be individually independent and subject to identical distributions of values of interest. Sampling saves project managers a lot of time and money:  Obtains practical and useful results even when it is not economical to obtain and evaluate every data point in a population  Extends the project access even though it may not be practical to reach every member of the population.  Provides actionable information even when it is not possible to know every member of the population.  Avoids spending too much time to observe, measure, or interview every member of the population  Avoids collecting too much data to handle even if every member of the population were readily available—to include expense of data handling and timeliness of data handling Analysis by sampling is called ‘drawing an inference’, and the branch of statistics from which it comes is called ‘inferential statistics’. Drawing an inference is similar to ‘inductive reasoning’. In both cases, inference and induction, one works from a set of specific observations back to the more general case, or to the rules that govern the observations.What about risks?Sampling introduces risk into the project:  Risk that the data sample may not accurately portray the population—there may be inadvertent exclusions, clusters, strata, or other population attributes not understood and accounted for.©Copyright 2010 John C Goodpasture Page 1
3. 3.  Risk that some required information in the population may not be sampled at all; thus the sample data may be deficient or may misrepresent the true condition of the population.  Risk that in other situations, the data in the sample are outliers and misrepresent their true relationship to the population; the sample may not be discarded when it should be.Risk assessmentsThere are two risk assessments to be made. Examples in this paper will illustrate thesetwo assessments. 1. “Margin of error”, which refers to the estimated error around the measurement, observation, or calculation of statistics within the interval of the sample data, and 2. “Confidence interval”, which refers to the probability that true population parameters are within the range of the interval.Margin of error is the percentage of the interval relative to the statistic being estimated: % error = Interval / Average, or Interval / Proportion (x100)Because margin of error is a ratio, the risk manager actually has to be concerned for boththe numerator and the denominator: for small statistical values [for a small denominator]the interval [the numerator] must be likewise small—and, a small interval is achieved byhaving a large sample size, N.Confidence intervals have their own risk. The principle risk is that the samplemisrepresents the population. If confidence is stated as 95% for some interval, then thereis a 5% chance that the true population parameter lays outside the interval. Consider thiscase: a population with a parameter real value of 8 is sampled [of course, this fact—thereal value of 8—is unknown to the project team]. But, also unknown to the project team,for example: the sample is influenced by some outliers in the population; from the sampledata the sample average is calculated to be 10. The sample average is accepted as truealthough it is false.Sample design and sample riskWould more samples of the same size improve the chances of getting it right? Perhaps.However, the definition of confidence covers the case: Of all the sample intervalsobtained, 95% of them will contain the true population parameter; or, for only oneinterval obtained, there is a 95% chance that the true population parameter is within theinterval.©Copyright 2010 John C Goodpasture Page 1
4. 4. Generally, to reduce risk, the sample size, N, is made larger, rather than resampling thesame population with the same size sample.Deciding upon the sample size—meaning: the value of N—introduces a tension betweenthe project‟s budget and/or schedule managers, and the risk managers. Tension is anotherword for risk.  Budget managers want to limit the cost of gathering more data than is needed and thereby limit cost risk—in other words, avoid oversampling.  Risk managers want to limit the impact of not having enough data and thereby limit functional, feature, or performance risk.Sampling policyThe risk plan customarily invokes a project management policy regarding the degree ofrisk that is acceptable:  “Margin of error” is customarily accepted between +/- 3 to 5%  “Confidence Interval” is customarily a pre-selected percentage between 80 and 99%, most commonly 95% or 99%.The sampling protocol for a given project is designed by the risk manager to supportthese policy objectivesGeneral examplesBelow are several population examples that are common in project situations. They fallinto one of two population types, discrete proportions and continuous data.  Project managers and the project office often deal with proportions  Project control account managers and team leaders often deal with “continuous data”. 1. Populations of categorical data characterized with proportions: Proportional data is sometimes called ‘categorical data’ or ‘category data’; in Six Sigma, such data is called ‘attribute data’. For example, a semi-conductor wafer fits either into a category of ‘defect free’ or into another category of ‘defective’. Proportion is often notated as ‘p’ for one category, and ‘1-p’ for the other. ‘1-p’ is sometimes denoted ‘q’. The proportion, p, is measurable and thus has statistical characteristics, but the underlying entity does not. In other words, we speak of the average proportion of defects, but not the average defect. 2. Populations of continuous data: Continuous data is measured on a continuous number scale; it can be compared with other continuous data and can be manipulated with©Copyright 2010 John C Goodpasture Page 2
5. 5. arithmetic operations. The ‘distance’ between one point on the scale and another has a real meaning, not just a relative separation as on an ordinal scale. Measureable entity attributes are characterized with descriptive statistics, like average weight, or average hours of experience; and other statistics that can be calculated from data, like standard deviation and variance. Six Sigma refers to such populations as having ‘continuous or variable data’ attributes, referring to the idea that such attributes can be measured on a continuous scale.Examples of Categorical data populations characterized with proportions: A proportion of Users/operators/ maintenance and support/beneficiariesOpinion who have one opinion or another about a feature or function. A proportion of devices or objects have a defect, and others do not, Or possess an attribute that pass/fails some metric limit, like powerDefects and consumed.pass/fail Typically, pass/fail results are observed in a number of independent tests, inspections, or ‘trials’. A proportion of devices or objects that are of a certain type/category/classification. This situation comes up often in database projects where databaseClassification records may or may not meet a specific type classification.and position But all manner of tangible objects also have type classifications, such as hard wood or soft wood, steel or stainless steel. A proportion of devices that are positioned above, between, or below some ‘critical’ boundary, like a quartile or percentile limit.Examples of Continuous measurement populations Average age of a user group, average drying time of a coating, average time to code a design object, or average time to repair an object.Objects withMeasurable Average difference between user groups, drying time, coding time, or repairattributes time of one or another object Average distance to [or between] object coordinates©Copyright 2010 John C Goodpasture Page 3
6. 6. Process example: A process for which the arrival rate of event—like a trigger or a device failure—or the count of events in a unit of time or space is important. In a web commerce project, an example is the arrival rate ofProcess events customers to the product ordering page.and Opportunity Opportunity example: An ‘area’ in which events can occur. In a chemical development project, an event could be the appearance—yes, or no—of a certain molecule after some process activity; the measurable opportunity is the count of a certain molecule per cubic centimeter.Project EstimatesRegardless of the nature of the population, the issues for the project manager are thesame:  Effort: How much effort will sampling take? The LLN tells us the sample statistics will be „good enough‟ if the sample is „large enough‟. For project managers the question is: How large is „large enough‟?  Impact: What is the impact of the risk to be mitigated? Confidence statistics and margins of error of the sample provide the ranges of the impact.Risk management and estimating rules of thumb The actual size of the population is irrelevant—so long as it is ‘large’Population size compared to the sample. Population size is not used in estimates, even if known, unless the population is ‘small’ when compared to the sample size. Sample size [count of values in the sample] is driven by risk tolerance for theSample size possible error in the sample results. A larger count reduces error possibilities.[count of values] There are formulas for sample size that take into account risk tolerance. The margin of error in the estimated statistic improves with increasing countMargin of error of data values in the sample Confidence that the actual population parameter is within the sample dataPopulation interval improves as the interval is made wider for a given number of samplesparameter values.confidence Thus, for a sample of 30 values, the confidence interval for 99% confidence is wider than for 90% confidenceCommon The most common confidence intervals are 80, 90, 95, and 99%.intervalsEstimating proportional parametersSample proportion notation:©Copyright 2010 John C Goodpasture Page 4
7. 7.  One category is given a proportion notated „p‟.  „1-p‟ notates the sample proportion of the other category [sometimes „1-p‟ is denoted as „q‟]Project example with proportion:Project description: Let‟s say that a project deliverable is a database that has over 10Mdata records to be loaded from a library [population]. Depending on the mix ofcategories of data records in the population, the scheduling manager will schedule moreloading time if mostly category-1, or less time if not mostly category-1.The project manager elects to sample the data record population to determine theproportionality, p, of records that are category-1 so that the scheduling manager hasinformation to guide project scheduling.The project risk management plan requires estimates to have 95% confidence for designparameters, and a margin of error of less than +/- 5% on sample data values.Sample design: With no a priori hypothesis of the expected proportionality of „p‟, someiteration may be required. The risk manager refers to the chart given in the appendixentitled “Proportion „p‟ vs +/- Margin of Error %” that is a plot of error percentage for aconfidence of 95%. From that chart, the risk manager finds that to maintain less than 5%margin of error of „p‟ with 95% confidence requires a sample size of at least 1,000, andperhaps as large as 10,000.Starting with N = 1,000, if the first sample returns a „p‟ value that is 0.6 or greater, themargin of error is likely less than +/- 5%; no further sampling is required. Otherwise, alarger sample size is required.Sample analysis: Assume the sample returns a value of „p‟ of 0.7 using a sample of N =1,000. From the confidence interval equation for proportions given in the appendix, the95% confidence interval for the estimated proportion is calculated to be 67% to 73%,centered on 70%.Risk management analysis: There is a 5% probability that the proportion „p‟ is notwithin the confidence interval of 67% to 73%. There is not enough information toforecast whether the proportion „p‟ is more likely less than 67% or greater than 73%.From the chart in the appendix for margin of error, the margin of error of theproportionality value 0.7 is about +/- 4 %, or +/- 0.028, from 0.67 to 0.73.The sample data supports the project risk tolerance policy objectives of 95% confidenceand < +/- 5% margin of error.©Copyright 2010 John C Goodpasture Page 5
8. 8. Estimating continuous data parameters:Project example with descriptive statisticsProject description: Let‟s say that a project deliverable is an ejector seat for a militaryaircraft; the average weight of the pilot population needs be known for the design.The project manager elects to sample the pilot population rather than weigh every pilot.The project risk management plan requires estimates to have 95% confidence for designparameters and +/- 3% margin of error for sample data statistics.Sample Frame: From the chart in the appendix entitled “% Margin of Error v N, 95%Confidence” the risk manager finds that a sample of size 85 is required to meet the +/-3% policy metric and simultaneously meet the 95% confidence interval metric. So, in thisexample, 85 pilots are weighed from a population frame of active duty military pilots,both men and women.Assume the sample average is found from the sample data to be 175 lbs [79.4 kg], andthe Sample σ is calculated from the sample data by spreadsheet function. Assume theSample σ is calculated to be 25 lbs [11.3 kg].Sample analysis: From the equation given in the appendix for continuous data, the 95%confidence interval for the estimated average weight of the pilot population is estimatedto be about +/- 5.4 lbs [+/- 2.4 kg], or from 169.6 to 180.4 lbs [76.9 to 81.8 kg].Risk management analysis: There is a 5% probability that the average pilot weight isnot within the confidence interval.The sample average of 175 pounds is estimated to have a margin of error of +/- 3%, or+/- 5.2 pounds [+/- 2.4 kg].©Copyright 2010 John C Goodpasture Page 6
9. 9. AcknowledgementThe author is indebted to Dr. Walter P. Bond, Associate Professor (retired) of FloridaInstitute of Technology for suggestions and peer review.AppendixProportional category dataConfidence Interval, proportionsThe following equations define the confidence interval for varying confidence objectives,where  is the symbol for „square root‟ [sqrt]. The numerical factor is a so-called „Z‟number taken from the standard normal bell curve. Z = 1 corresponds to one standarddeviation, σ. Z values typically range +/- 3 about the standard normal mean of „0‟. 80% Interval = p +/- 1.3 * [p * (1 - p) / N] 90% Interval = p +/- 1.7 * [p * (1 - p) / N] 95% Interval = p +/- 2 * [p * (1 - p) / N] 99% Interval = p +/- 2.7 * [p * (1 - p) / N] The confidence objective, expressed as a %, is read as, for example, 80% confidence the real population parameter is within the interval, and 20% confidence the real population parameter is outside the interval.Margin of Error, proportionsThe following chart is a plot of three different sample sizes, N, showing the margin oferror as the proportion „p‟ changes. This chart is based on the formula for margin of errorgiven below: +/- Margin of Error = ½ Interval width / p Where ½ Interval width = +/- Z * [p * (1 - p) / N] And where Z = 2* for 95% confidence* more precisely: 1.96©Copyright 2010 John C Goodpasture Page 7
10. 10. Continuous data and descriptive statisticsConfidence IntervalThe following equations give approximations of the interval range. „N‟ is the count ofdata values in the sample; N is the square root of „N‟. The numerical factor in thenumerator comes from a table of „t‟ values that are developed by statisticians forsampling analysis. The „t‟ value depends on the count of the sample points. The „t‟ valuetypically ranges +/- 3. 80% Interval = Sample average +/- (1.3 / N) x Sample σ [narrowest interval] 90% Interval = Sample average +/- (1.7 / N) x Sample σ 95% Interval = Sample average +/- (2 / N) x Sample σ 99% Interval = Sample average +/- (2.7 / N) x Sample σ [widest interval]Note: The Sample σ, or sample standard deviation, is calculated, usually by spreadsheetfunction, from the sample data.©Copyright 2010 John C Goodpasture Page 8
11. 11. Margin of errorThe margin of error is based on the following equation: Margin of error = +/- ‘t’ * sample σ / √N Sample average Where ‘t’ = 2* for 95% confidence interval*more precisely: 1.96The following is a plot for the margin of error as a function of the sample size, N©Copyright 2010 John C Goodpasture Page 9