Data CollectionProcess and Integrity of Data Collection forLater Software Cost Estimation Calibrations Gerrit Klaschke
What is covered? Data Collection What is it? Process Best Practices Data Integrity Checklist Additional Tips
Data Collection Process A data collection process should cover several important parts: Ensure high quality data (see Data Collection Integrity) How to collect data from the sources How to store the data for later retrieval (analyses and calibration) The process itself must be refined to the point where data received has some confidence to it - Not just taking what someone wrote on the form at face value! The reason for collecting data and what data needs to be collected is manifold. Goal-Question-Metric approach can help defining what metrics you need to answer certain questions and reach goals. Goals can range from quality improvement to schedule decrease.
Data Collection Process When to Collect Data When scoping a new project During development for management and to identify issues and progress Post Mortem to improve corporate history repository (database of completed projects) During maintenance to continue improving
Data Collection Process Suggested Central Repository Requirements Database must be extensible so new fields can be added easily Must be open, not a proprietary database Approach allows hosting on standalone laptop for traveling users etc Additional speed over browser based versions Read information into excel or access via ODBC. Not limited to provided functionality like many browser based applications.
Data Collection Process Basic Flow Individuals or organizations will send their data of completed projects to the metrics analyst or person responsible for collection/analysis. All incoming data must be stored (which should includes versioning in case updates come in from the same source) and then reviewed for integrity and completeness. If there are uncertainties, the metrics analyst has to clarify the points. Having a RDBMS system makes tracking and updates very easy. If a normalization process is required, save both versions of the data. Once a completed project passes the QA, it will be available in the database for retrieval. This includes retrieval for the purpose of ‘estimate by analogy’, more analysis (GQM or finding new correlations) and calibration of estimation models.
Data Collection Process – Lessons Learned Identify business goals. Use GQM. Setting goals enables a metric program to enhance business results, reduce cost by keeping a program well-defined and focused, and ensure a basis for improving a business’ return of investment for IT. Clear definitions are essential but people will not always follow them. Personally talk to them and interview to capture data. Do not just take a form as face value. Doing this will improve the quality of data as the interviewer can ask questions to clarify. People don’t read instructions. They might provide ‘just a number’ off the top of their head. Some people might misinterpret the data on purpose to make them look better on by mistake. Personally talk to them. Sensitive data: if people/departments/companies don’t want to share sensitive data or have concerns, try to sanitize the data.
Data Collection Process – Lessons Learned Cost of data collection: some will claim that data collection cost too much. Go through the list of benefits and back it up by data showing that estimation/project success increases when using a historical database/calibrated models. E.g. tell your manager for instance “software metrics will help us reduce the number of faults reported in newly developed software by 25% without increasing project schedules. The resulting savings in support costs should drive a 150% ROI in the first year”. Cost of data collection 2: some developers will claim they are not paid to collect data. Determine their claimed CMM/CMMI rating. If it is 3 or higher, collecting data is required. Ask for that data in their format and offer to fill in the forms yourself.
Data Collection Process – Lessons Learned Use a good code counter. See the list of code counters on the QSM.com site. The ‘understand’ code counter is also used quite often in companies. Be sure to discriminate auto-generated code from hand generated code. Auto-generated code does not have the same correlation to effort as hand generated. Collect completed project actuals first: Start by collecting data from completed projects first and THEN collect from projects that are still underway.
Data Collection Process – Lessons Learned Qualify the data quality: Some data collected will be nonsensical. There are 2 approaches to handle this: Eliminate this data altogether. (not really recommended as data is lost) Include a qualifier on the data rating it ‘a’ to ‘f’. The ISBSG database has a rating similar to this. Capture both total size and amount of reuse: Reuse is an essential part of software size. Just collecting total size will skew the size/effort correlation. Don’t eliminate data points just because of the programming language: size can be converted from one language to another!
Data Collection Process – Lessons Learned Have a normalization process and keep the data both in raw and normalized forms. Data will be collected in varying phases, labor categories, size definitions etc. Keep the raw data. And have a standard, well documented normalization process that is rigorously followed to normalize to a standard set of activities, phases etc. Have a structure for data storage: An excel sheet can be used but will become unworkable as the database grows. Get the data into an open database asap. Offer them something in return: this could be a sanitized copy of the database or at least a benchmark showing how data fits with the rest of the database!
Data Integrity Good quality data is paramount to ensure good calibration results.
Data Integrity - Checklist Review the goal of the data collection What is the data being used for? E.g. project type calibration, later use for estimation by analogy etc. This drives the data being collected. Ensure the integrity of the data collection process Have the groups providing data been trained with regard to the required data? Definitions Are different projects providing data using the same data definitions?
Data Integrity - Checklist Approval of Inputs Has at least one designated individual approved the inputs for each project? Missing Data Has any missing data been identified? Estimates/Actuals Are estimates of data items used in place of missing actual data? Rationale Provide written rationale for any estimates used in the calibration
Data Integrity - Checklist Sensitivity Analysis If estimates are used in lieu of actuals, has a sensitivity analysis been done to evaluate the impact on the calibration of varying assumptions with respect to the estimates? Extra Data Has any extra data or different definitions been used? Changes Describe any changes made and the rationale for them.
Data Integrity - Checklist Additional Data Has any additional data been collected that can be used for later purposes? Identify the extra data and how it might be used. Examples include effort and schedule portions for detailed phases and activities. Size Conversion Have all size measures been converted to eSLOC or another base unit?
Data Integrity - Checklist Counting Conventions What SLOC counting conventions were followed (logical SLOC, physical SLOC etc)? If SLOC is not used, what definitions were followed (such as IFPUG 4.2 standard, use cases 2.0) Reuse Are all reuse parameters provided for reused, modified and COTS software portions? Has all reuse and modification been accounted for and converted into equivalent SLOC?
Data Integrity - Checklist Reused/Modified Does the total equivalent size include all new software and the equivalent sizes of reused and modified software? Evolution Has Requirements Evolution been reported? Input Ranges Make sure that there are no ranges in the volume input, as that would indicate previously estimated values. Factors Has the environment and scaling factors been updated? Hours per Month Has the correct HPM been applied?
Data Integrity – Additional Tips Actual Phase Information: all activities may NOT be included. E.g. system concept and integration is excluded. Actual Labor Information: all activities may NOT be included. E.g. configuration and quality assurance is excluded. Was the schedule ‘stop and start’? Resources: where there hard-hitting resource constraints? Volatility: did requirements undergo extraordinary evolution? Manager’s objectives: was the project to complete in ‘minimum time’ or ‘least cost’? Effort: are effort figures actually derived from cost figures? Always run sanity checks on data. E.g. one million lines of code cannot be developed in 3 months.