Carma internet research module detecting bad data

601 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
601
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Carma internet research module detecting bad data

  1. 1. Detecting Bad Data<br />CARMA Research Module<br />Jeff Stanton<br />
  2. 2. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-2)<br />Sources of Data Problems in Online Studies<br />Technical errors:<br />Programming errors: Not common, but damaging when they occur<br />Server errors: Can halt the collection of data<br />Transmission errors: Uncommon and usually isolated to one record or field<br />Response fraud:<br />Inadvertent multiple response and malicious multiple response<br />Missing data<br />Intentionally malicious patterns of response leading to outliers or self-contradictory data<br />
  3. 3. Response Fraud<br />Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process<br />Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality<br />Minimal frauds: skipping questions, not thinking through the answers<br />Maximal frauds: A robot that randomly answers <br />May 18-20, 2006<br />Internet Data Collection Methods (Day 2-3)<br />
  4. 4. Duplicate Detection<br />Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns<br />Create a new variable that contains this unique “checksum” value for each row/case<br />Sort the dataset on the checksum<br />Create a lag difference variable that subtracts the checksum for each neighboring row<br />Sort on the lag variable and investigate all cases of zero or small differences<br />May 18-20, 2006<br />Internet Data Collection Methods (Day 2-4)<br />
  5. 5. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-5)<br />Bogus Response Detection <br />Calculate common univariate statistics using the complete row of responses for each subject<br />Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min)<br />Sort the cases by the mean value<br />Look for extreme outliers on the high and low ends<br />Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum<br />Look for anomalies and trace them back to the original data for that subject<br />
  6. 6. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-6)<br />Multivariate Outlier Detection<br />Use Mahalanobis distance to detect outliers<br />Regress a set of related items on an arbitrary dependent variable<br />Sort by Mahalanobis distance: Larger distances are suggestive of outliers<br />Use autocorrelation to detect unusual data patterns<br />Flip the data: Cases become variables and variables become cases<br />Run an autocorrelation function<br />Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags)<br />I have provided example SPSS code in the utilities area of the LMS for each of these tests<br />
  7. 7. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-7)<br />Mahalanobis<br />
  8. 8. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-8)<br />Plot, Sort, and Examine<br />
  9. 9. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-9)<br />An ACF Indicating No Pattern<br />
  10. 10. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-10)<br />An ACF with a Suspicious Pattern<br />
  11. 11. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-11)<br />Common Missing Data Mitigation Techniques<br />Item imputation<br />For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset<br />Mean substitution<br />Suppresses variability<br />Time series imputation<br />Mean of neighboring points; suppresses spikes<br />Regression imputation, works well for highly intercorrelated variables<br />Full information maximum likelihood imputation<br />Available in some SEM programs<br />
  12. 12. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-12)<br />Excel Tips<br />Your friend the “fill” function<br />The power of “Paste Special”<br />Sorting: Click on Data/Sort<br />
  13. 13. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-13)<br />Excel Statistical Formulas<br />=find(<find text>, <within text>, <start>)<br />Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start><br />Example: =find(“=“, “fish=head”, 1)<br />=Len(<string>)<br />Returns the number of characters in a string<br />Example =Len(“Ouch”)<br />=Right(<string>,<length>)<br />Returns the rightmost <length> characters in string<br />Example: =Right(“fishhead“,4)<br />=Left(<string>,<length>) works similarly<br />=average(value, value…)<br />Gives the arithmetic mean of a collection of cells and/or numeric values<br />=stdev(value, value…) // stdevp(value, value…)<br />Gives the sample/population standard deviation of a collection of cells and/or numeric values<br />=sum(value, value…)<br />Gives the sum of a collection of cells and/or numeric values<br />=correl(vector1, vector2)<br />Gives the pearson correlation between two vectors<br />=if(<test>,<value if true>,<value if false>)<br />Makes a logical test and returns a different value depending on whether the test is true or false<br />Example =if(1=1, “Yes!”, “No…”)<br />
  14. 14. May 18-20, 2006<br />Internet Data Collection Methods (Day 2-14)<br />Summary of Bad Data Problems<br />Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back…<br />Unmotivated responding: participant uses same option over and over again<br />Malicious patterns: Participate enters some unusually regular pattern of responses<br />There are at least five errors of these kinds in the exercise dataset (see below)<br />

×