Carma internet research module   detecting bad data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Carma internet research module detecting bad data

on

  • 523 views

 

Statistics

Views

Total Views
523
Views on SlideShare
523
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Carma internet research module detecting bad data Presentation Transcript

  • 1. Detecting Bad Data
    CARMA Research Module
    Jeff Stanton
  • 2. May 18-20, 2006
    Internet Data Collection Methods (Day 2-2)
    Sources of Data Problems in Online Studies
    Technical errors:
    Programming errors: Not common, but damaging when they occur
    Server errors: Can halt the collection of data
    Transmission errors: Uncommon and usually isolated to one record or field
    Response fraud:
    Inadvertent multiple response and malicious multiple response
    Missing data
    Intentionally malicious patterns of response leading to outliers or self-contradictory data
  • 3. Response Fraud
    Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process
    Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality
    Minimal frauds: skipping questions, not thinking through the answers
    Maximal frauds: A robot that randomly answers
    May 18-20, 2006
    Internet Data Collection Methods (Day 2-3)
  • 4. Duplicate Detection
    Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns
    Create a new variable that contains this unique “checksum” value for each row/case
    Sort the dataset on the checksum
    Create a lag difference variable that subtracts the checksum for each neighboring row
    Sort on the lag variable and investigate all cases of zero or small differences
    May 18-20, 2006
    Internet Data Collection Methods (Day 2-4)
  • 5. May 18-20, 2006
    Internet Data Collection Methods (Day 2-5)
    Bogus Response Detection
    Calculate common univariate statistics using the complete row of responses for each subject
    Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min)
    Sort the cases by the mean value
    Look for extreme outliers on the high and low ends
    Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum
    Look for anomalies and trace them back to the original data for that subject
  • 6. May 18-20, 2006
    Internet Data Collection Methods (Day 2-6)
    Multivariate Outlier Detection
    Use Mahalanobis distance to detect outliers
    Regress a set of related items on an arbitrary dependent variable
    Sort by Mahalanobis distance: Larger distances are suggestive of outliers
    Use autocorrelation to detect unusual data patterns
    Flip the data: Cases become variables and variables become cases
    Run an autocorrelation function
    Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags)
    I have provided example SPSS code in the utilities area of the LMS for each of these tests
  • 7. May 18-20, 2006
    Internet Data Collection Methods (Day 2-7)
    Mahalanobis
  • 8. May 18-20, 2006
    Internet Data Collection Methods (Day 2-8)
    Plot, Sort, and Examine
  • 9. May 18-20, 2006
    Internet Data Collection Methods (Day 2-9)
    An ACF Indicating No Pattern
  • 10. May 18-20, 2006
    Internet Data Collection Methods (Day 2-10)
    An ACF with a Suspicious Pattern
  • 11. May 18-20, 2006
    Internet Data Collection Methods (Day 2-11)
    Common Missing Data Mitigation Techniques
    Item imputation
    For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset
    Mean substitution
    Suppresses variability
    Time series imputation
    Mean of neighboring points; suppresses spikes
    Regression imputation, works well for highly intercorrelated variables
    Full information maximum likelihood imputation
    Available in some SEM programs
  • 12. May 18-20, 2006
    Internet Data Collection Methods (Day 2-12)
    Excel Tips
    Your friend the “fill” function
    The power of “Paste Special”
    Sorting: Click on Data/Sort
  • 13. May 18-20, 2006
    Internet Data Collection Methods (Day 2-13)
    Excel Statistical Formulas
    =find(<find text>, <within text>, <start>)
    Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start>
    Example: =find(“=“, “fish=head”, 1)
    =Len(<string>)
    Returns the number of characters in a string
    Example =Len(“Ouch”)
    =Right(<string>,<length>)
    Returns the rightmost <length> characters in string
    Example: =Right(“fishhead“,4)
    =Left(<string>,<length>) works similarly
    =average(value, value…)
    Gives the arithmetic mean of a collection of cells and/or numeric values
    =stdev(value, value…) // stdevp(value, value…)
    Gives the sample/population standard deviation of a collection of cells and/or numeric values
    =sum(value, value…)
    Gives the sum of a collection of cells and/or numeric values
    =correl(vector1, vector2)
    Gives the pearson correlation between two vectors
    =if(<test>,<value if true>,<value if false>)
    Makes a logical test and returns a different value depending on whether the test is true or false
    Example =if(1=1, “Yes!”, “No…”)
  • 14. May 18-20, 2006
    Internet Data Collection Methods (Day 2-14)
    Summary of Bad Data Problems
    Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back…
    Unmotivated responding: participant uses same option over and over again
    Malicious patterns: Participate enters some unusually regular pattern of responses
    There are at least five errors of these kinds in the exercise dataset (see below)