• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Carma internet research module   detecting bad data
 

Carma internet research module detecting bad data

on

  • 455 views

 

Statistics

Views

Total Views
455
Views on SlideShare
455
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Carma internet research module   detecting bad data Carma internet research module detecting bad data Presentation Transcript

    • Detecting Bad Data
      CARMA Research Module
      Jeff Stanton
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-2)
      Sources of Data Problems in Online Studies
      Technical errors:
      Programming errors: Not common, but damaging when they occur
      Server errors: Can halt the collection of data
      Transmission errors: Uncommon and usually isolated to one record or field
      Response fraud:
      Inadvertent multiple response and malicious multiple response
      Missing data
      Intentionally malicious patterns of response leading to outliers or self-contradictory data
    • Response Fraud
      Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process
      Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality
      Minimal frauds: skipping questions, not thinking through the answers
      Maximal frauds: A robot that randomly answers
      May 18-20, 2006
      Internet Data Collection Methods (Day 2-3)
    • Duplicate Detection
      Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns
      Create a new variable that contains this unique “checksum” value for each row/case
      Sort the dataset on the checksum
      Create a lag difference variable that subtracts the checksum for each neighboring row
      Sort on the lag variable and investigate all cases of zero or small differences
      May 18-20, 2006
      Internet Data Collection Methods (Day 2-4)
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-5)
      Bogus Response Detection
      Calculate common univariate statistics using the complete row of responses for each subject
      Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min)
      Sort the cases by the mean value
      Look for extreme outliers on the high and low ends
      Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum
      Look for anomalies and trace them back to the original data for that subject
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-6)
      Multivariate Outlier Detection
      Use Mahalanobis distance to detect outliers
      Regress a set of related items on an arbitrary dependent variable
      Sort by Mahalanobis distance: Larger distances are suggestive of outliers
      Use autocorrelation to detect unusual data patterns
      Flip the data: Cases become variables and variables become cases
      Run an autocorrelation function
      Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags)
      I have provided example SPSS code in the utilities area of the LMS for each of these tests
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-7)
      Mahalanobis
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-8)
      Plot, Sort, and Examine
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-9)
      An ACF Indicating No Pattern
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-10)
      An ACF with a Suspicious Pattern
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-11)
      Common Missing Data Mitigation Techniques
      Item imputation
      For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset
      Mean substitution
      Suppresses variability
      Time series imputation
      Mean of neighboring points; suppresses spikes
      Regression imputation, works well for highly intercorrelated variables
      Full information maximum likelihood imputation
      Available in some SEM programs
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-12)
      Excel Tips
      Your friend the “fill” function
      The power of “Paste Special”
      Sorting: Click on Data/Sort
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-13)
      Excel Statistical Formulas
      =find(<find text>, <within text>, <start>)
      Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start>
      Example: =find(“=“, “fish=head”, 1)
      =Len(<string>)
      Returns the number of characters in a string
      Example =Len(“Ouch”)
      =Right(<string>,<length>)
      Returns the rightmost <length> characters in string
      Example: =Right(“fishhead“,4)
      =Left(<string>,<length>) works similarly
      =average(value, value…)
      Gives the arithmetic mean of a collection of cells and/or numeric values
      =stdev(value, value…) // stdevp(value, value…)
      Gives the sample/population standard deviation of a collection of cells and/or numeric values
      =sum(value, value…)
      Gives the sum of a collection of cells and/or numeric values
      =correl(vector1, vector2)
      Gives the pearson correlation between two vectors
      =if(<test>,<value if true>,<value if false>)
      Makes a logical test and returns a different value depending on whether the test is true or false
      Example =if(1=1, “Yes!”, “No…”)
    • May 18-20, 2006
      Internet Data Collection Methods (Day 2-14)
      Summary of Bad Data Problems
      Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back…
      Unmotivated responding: participant uses same option over and over again
      Malicious patterns: Participate enters some unusually regular pattern of responses
      There are at least five errors of these kinds in the exercise dataset (see below)