• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data   What Type Of Data Do You Have V2.1

Data What Type Of Data Do You Have V2.1



Data for Statistics - A discussion about Data Types not found in the CMMI

Data for Statistics - A discussion about Data Types not found in the CMMI



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Data   What Type Of Data Do You Have V2.1 Data What Type Of Data Do You Have V2.1 Presentation Transcript

    • DATA
    • Data
      Data – Input for Analysis and Interpretation
      Data are generally collected as a basis for action
      You must always use some method of analysis to extract and interpret the information that lies in the data
      The type of data that has been collected will determine the type of statistics or analysis that can be performed
      Making sense of the data is a process in itself
      Always provide a “context” for data
      Data has no meaning apart for their context
      Data should always be presented in such a way that preserves the evidence in the data for all the predictions that might be made from these data
    • Data - 2
      Data should be completely and fully described
      Who collected the data?
      How were the data collected?
      When were the data collected?
      Where were the data collected?
      What do these values represent?
      If the data are computed values, how were the values computed from the raw inputs?
    • Data - 3
      Variation exists in all data and consists of both noise (random or common cause variation) and signal (nonrandom or special cause variation)
      Without formal and standardized approaches for analyzing data, you may have difficulty interpreting and using your measurement results
      When you interpret and act on measurement results, you are presuming that the measurements represent reality
    • Data - 4
      To use data safely, you must have simple and effective methods not only for detecting signals that are surrounded by noise,
      but also for recognizing and dealing with normal process variations when there are no signals present
      Drawing conclusions and predictions from data depends not only on using appropriate analytical methods and tools,
      but also on understanding the underlying nature of the data and the appropriateness of assumptions about the conditions and environments in which the data were obtained
    • Data Definitions
      Categorical vs. Quantitative Variables - Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical)
      Categorical - Categorical variables take on values that are names or labels. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables.
      Quantitative - Quantitative variables are numerical. They represent a measurable quantity.
      For example, when we speak of the population of a city, we are talking about the number of people in the city - a measurable attribute of the city. Therefore, population would be a quantitative variable
    • Data Definitions - 2
      Discrete vs. Continuous Variables - Quantitative variables can be further classified as discrete or continuous.
      If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable.
      Examples to clarify the difference between discrete and continuous variables.
      Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter's weight could take on any value between 150 and 250 pounds.
      Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.5 heads. Therefore, the number of heads must be a discrete variable.
    • Attributes Data vs. Variables Data
    • Variables Data
      Variables data is measured and plotted on a continuous scale
      With variables data, an actual numeric estimate is derived for one or more characteristics of the population being sampled such as:
    • Variables Data - 2
      In software, examples of variables data include:
      Effort expended - (Number of hours, days, weeks, years, etc., that have been expended by a workforce member on an identified topic)
      Years of experience - (Total number of years of experience per category)
      Memory utilization - (% of total memory available)
      CPU utilization - (% of CPU used at any given moment in time)
      Cost of rework - (Dollars and cents calculation of the rework based on the effort put forth by anyone involved in the finding and fixing of reported problems)
    • “Counts” Could Be Treated as Variables Data
      There are many situations where “counts” get used as measures of size:
      Total number of requirements
      Total lines of code
      Total bubbles in a data-flow diagram
      Customer sites
      Change requests received
      Total people assigned to a project
      When we count these things, we are counting all the entities in a population, not just the occurrence of entities with specific attributes
      These should always be treated as “variables” data even though they are instances of discrete counts
    • Attributes Data
      When working with attributes data, the focus is on learning about one or more specific non-numerical characteristics of the population being sampled
      When attributes data are used for direct comparisons, they must be based on consistent “areas of opportunity” if the comparisons are to be meaningful
      If the number of defects that are likely to be observed depends on the size (lines of code) of a module or component, all sizes must be nearly equal
      If the probabilities associated with defect discovery depend on the time spent on inspecting or testing the elapsed time spent must be nearly equal
    • Attributes Data - 2
      In general, when the areas of opportunity for observing a specific event are not equal or nearly so, the chances of observing the event will differ across the observations
      Then we must normalize (convert to rates) by dividing each count by its area of opportunity before valid comparisons are made
      Conditions that make us willing to assume constant areas of opportunity seem to be less in software environments
      Normalization is almost always needed for software!
    • Attributes Data - 3
      If the defects are being counted and the size of an item inspected influences the number of defects found, some measure of item size will also be needed to convert defect counts to relative rates that can be compared in meaningful ways (defects per lines of code)
      If the variations in the amount of time spent inspecting or testing can influence the number of defects found, these times should be clearly defined and measured as well
    • Attributes Data - 4
      One of the keys to making effective use of attributes data lies in preserving the ordering of each count in space and time
      Sequence information (the order in time or space in which the data is collected) is almost always needed to correctly interpret counts of attributes
      Make the counts specific – Make sure there is an operational definition (clear set of rules and procedures) for recognizing an attribute or entity if what gets counted is to be what the user of the data expects the data to be
    • Attributes Data - 5
      Attributes data is counted and plotted as discrete events:
      Shipping errors
      Percentage waste
      Number of defects found
      Number of defective items
      Number of source statements of a given type
      Number of lines of comments in a module of n lines
      Number of people with certain skills on a project
      Percentage of projects using formal inspections
      Team size
      Elapsed time between milestones
      Staff hours logged per task
      Number of priority-one customer complaints
      Percentage of non-conforming products in the output of an activity or a process
    • The Key to Classifying Data
      The key to classifying data as attributes data or variables data depends not so much on whether the data are discrete or continuous, but on how they are collected and used
      The total number of defects found is often used as a measure of the amount of rework or retesting to be performed
      It is viewed as a measure of size and treated as variables data
      It is normally used as a count based on attributes
      The method of analysis you choose for any data will depend on:
      The questions you are asking
      The data distribution model you have in mind
      The assumptions you are willing to make with respect to the nature of the data (Page 79)
    • Data Type Classifications
    • Distributional ModelsRelationship to Chart Types
      Each type of chart is related to a set of assumptions (a distributional model) that must hold for that type of chart to be valid.
      There are six types of charts for “attributes data”
      XmR for counts
      XmR for rates
    • XmR charts have an advantage over np, p, c, and u charts in that they require fewer and less stringent assumptions
      They are easier to plat and use
      They have wide applicability
      Recommended by many quality-control professionals
      When assumptions of the distributional model are met, however, the more specialized np, p, c, and u charts can give better bounds for control limits and can offer advantages
      Distributional Models Relationship to Chart Types - 2
    • Distributional ModelsRelationship to Chart Types - 3
      NP Chart – An np chart is used when the count data are binomially distributed and all samples have equal areas of opportunity
      These conditions occur in manufacturing settings – when there is 100% of lots of size n (n is constant) and the number of defective units in each lot is recorded
      P Chart – a p chart is used when the data are binomially distributed but the areas of opportunity vary from sample to sample
      A p chart could be appropriate if the lot size n were to change from lot to lot
    • Distributional ModelsRelationship to Chart Types - 4
      C Chart – a c chart is used when the count data are samples from a Poisson distribution and the samples all have equal-sized areas of opportunity
      U Chart – a u chart is used in place of a c chart when the count data are samples from a Poisson distribution and the areas of opportunity are not constant
      Defects per thousand lines of code is an example for software
      NP, P, C and U charts are the traditional control charts used with attributes data
      XmR Chart – Useful when little is known about the underlying distribution of when the justification for assuming a binomial or Poisson process is questionable
      Almost always a reasonable choice
    • Distributional ModelsRelationship to Chart Types - 5
      More About U Charts – U charts seem to have the greatest prospects for use in software settings
      U charts require normalization (conversion to rates) when the areas of opportunity are not constant
      Poisson might be appropriate when counting the number of defects in modules during inspection or testing
      Defects per thousand lines of source code is an example of attributes data that is a candidate for u charts
      Although u charts may be appropriate for studying software defect densities in an operational environment, we are not aware of any empirical studies that have generally validated the use of Poisson models for nonoperational environments such as inspections
    • Distributional ModelsRelationship to Chart Types - 6
      Defects per module or defects per test are unlikely candidates for u charts, c charts, or any other charts for that matter
      The ratios are not based on equal areas of opportunity – Can’t be normalized
      There is no reason to expect them to be constant across all modules or tests when the process is in statistical control
    • Distributional ModelsRelationship to Chart Types - 7
      If you are uncertain as to the model that applies, it can make sense to use more than one set of charts
      If you think you may have a Poisson situation but are not sure that all conditions for a Poisson process are present, then plotting both a u chart and the corresponding XmR charts should bracket the situation
      If both charts point to the same conclusions, you are unlikely to be led astray
      If the conclusions differ, then you should investigate your assumptions or the events
    • Presenting Data
      While it is simple and easy to compare one number with another, such comparisons are limited and weak
      Limited because the small amount of data used
      Weak because both of the numbers are subject to variation
      This makes it difficult to determine just how much of the differences between the values is due to variation in numbers and how much is due to real changes in the process
    • Presenting Data - 2
      Graphs – there are two basic graphs that are the most helpful is providing the context for interpreting the current value
      Time series graph (Run Chart)
      Have months or years marked off on the horizontal axis and possible values marked off on the vertical axis
      As you move from left to right, there is a passage of time
      By visually comparing the current value with the plotted values for the preceding months you can quickly see if the current value is unusual or not
      Histogram (Tally Plot)
      An accumulation of the different values as they occur without trying to display the time order sequence
    • Run Charts
      Number of Required Changes to a Module
      as the Project Approaches Systems Test
      and Test
    • 20
      Number of Days
      Product – Service Staff Hours
      A point above or below the
      control lines suggests that the
      measurement has a special
      preventable or removable cause
      The chart is used for continuous
      and time control of the process
      and prevention of causes
      Upper and
      Control Limits
      represent the
      natural variation
      In the process
      Center Line (CL)
      (Mean of data used to
      set up the chart)
      The chart is analyzed using
      standard Rules to define the
      control status of the process
      Plotted points are either
      individual measurements or the
      means of small groups of
      relating to
      the process
      Statistical Methods for Software Quality
      Adrian Burr – Mal Owen, 1996
      Numerical data taken
      in time sequence
    • Impacts of Poor Data Quality
      Inability to conduct hypothesis and predictive modeling
      Inability to manage the quality and performance software or application development
      Ineffective process change instead of process improvement
      Ineffective and inefficient testing causing issues with time to market, field quality, and development costs
      Products that are costly to use within real-life usage profiles
    • References
      Brassard, Michael & Ritter, Diane, The Memory Jogger II – A Pocket Guide of Tools for Continuous Improvement & Effective Planning, GOAL/QPC, Salem, New Hampshire, 1994
      Florac, W.A. & Carleton, A.D. Measuring the Software Process Addison-Wesley, 1999
      Six Sigma Academy, The Black Belt Memory Jogger – A Pocket Guide for Six Sigma Success, GOAL/QPC, Salem, New Hampshire, 2002
      Wheeler, Donald J. Understanding Variation: The Key to Managing Chaos, Knoxville, Tennessee: SPC Press, 2000