On October 23rd 2014, we updated our
Privacy Policy
and
User Agreement.
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
Carma internet research module detecting bad dataPresentation Transcript
1.
Detecting Bad Data CARMA Research Module Jeff Stanton
2.
May 18-20, 2006 Internet Data Collection Methods (Day 2-2) Sources of Data Problems in Online Studies Technical errors: Programming errors: Not common, but damaging when they occur Server errors: Can halt the collection of data Transmission errors: Uncommon and usually isolated to one record or field Response fraud: Inadvertent multiple response and malicious multiple response Missing data Intentionally malicious patterns of response leading to outliers or self-contradictory data
3.
Response Fraud Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality Minimal frauds: skipping questions, not thinking through the answers Maximal frauds: A robot that randomly answers May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
4.
Duplicate Detection Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns Create a new variable that contains this unique “checksum” value for each row/case Sort the dataset on the checksum Create a lag difference variable that subtracts the checksum for each neighboring row Sort on the lag variable and investigate all cases of zero or small differences May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
5.
May 18-20, 2006 Internet Data Collection Methods (Day 2-5) Bogus Response Detection Calculate common univariate statistics using the complete row of responses for each subject Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min) Sort the cases by the mean value Look for extreme outliers on the high and low ends Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum Look for anomalies and trace them back to the original data for that subject
6.
May 18-20, 2006 Internet Data Collection Methods (Day 2-6) Multivariate Outlier Detection Use Mahalanobis distance to detect outliers Regress a set of related items on an arbitrary dependent variable Sort by Mahalanobis distance: Larger distances are suggestive of outliers Use autocorrelation to detect unusual data patterns Flip the data: Cases become variables and variables become cases Run an autocorrelation function Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags) I have provided example SPSS code in the utilities area of the LMS for each of these tests
7.
May 18-20, 2006 Internet Data Collection Methods (Day 2-7) Mahalanobis
8.
May 18-20, 2006 Internet Data Collection Methods (Day 2-8) Plot, Sort, and Examine
9.
May 18-20, 2006 Internet Data Collection Methods (Day 2-9) An ACF Indicating No Pattern
10.
May 18-20, 2006 Internet Data Collection Methods (Day 2-10) An ACF with a Suspicious Pattern
11.
May 18-20, 2006 Internet Data Collection Methods (Day 2-11) Common Missing Data Mitigation Techniques Item imputation For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset Mean substitution Suppresses variability Time series imputation Mean of neighboring points; suppresses spikes Regression imputation, works well for highly intercorrelated variables Full information maximum likelihood imputation Available in some SEM programs
12.
May 18-20, 2006 Internet Data Collection Methods (Day 2-12) Excel Tips Your friend the “fill” function The power of “Paste Special” Sorting: Click on Data/Sort
13.
May 18-20, 2006 Internet Data Collection Methods (Day 2-13) Excel Statistical Formulas =find(<find text>, <within text>, <start>) Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start> Example: =find(“=“, “fish=head”, 1) =Len(<string>) Returns the number of characters in a string Example =Len(“Ouch”) =Right(<string>,<length>) Returns the rightmost <length> characters in string Example: =Right(“fishhead“,4) =Left(<string>,<length>) works similarly =average(value, value…) Gives the arithmetic mean of a collection of cells and/or numeric values =stdev(value, value…) // stdevp(value, value…) Gives the sample/population standard deviation of a collection of cells and/or numeric values =sum(value, value…) Gives the sum of a collection of cells and/or numeric values =correl(vector1, vector2) Gives the pearson correlation between two vectors =if(<test>,<value if true>,<value if false>) Makes a logical test and returns a different value depending on whether the test is true or false Example =if(1=1, “Yes!”, “No…”)
14.
May 18-20, 2006 Internet Data Collection Methods (Day 2-14) Summary of Bad Data Problems Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back… Unmotivated responding: participant uses same option over and over again Malicious patterns: Participate enters some unusually regular pattern of responses There are at least five errors of these kinds in the exercise dataset (see below)