Information Systems Scholars Research Tool AMT Benefits Limitations
1. Information Systems
Scholars
A Research Tool for
Organizations and
Information Systems
Scholars
Kevin Crowston
Syracuse University National Science Foundation
crowston@syr.edu kcrowsto@nsf.gov
http://crowston.syr.edu/
This research and presentation have been supported by National Science Foundation, the research through
Grant 09–68470. Any opinions, findings, and conclusions or recommendations expressed in this material are
those of the author and do not necessarily reflect the views of the National Science Foundation.
4. Potential benefits & limitations
of AMT for research
of AMT for research
Low cost to recruit subjects
Amazon handles payments, so Turkers are anonymous
Possible to recruit a diverse subject population
Can easily recruit multiple subjects at one time for
collaborative experiments
Only basic features for selecting or filtering participants
No control over work setting or equipment
Hard to know how well Turker understands task
Limited opportunities for follow up
Concerns about reliability and validity of data
7. Reliability & validity concerns
Research Mode 1: Data about
concern Turkers
Reliability (i.e.,
Use multiple indicators per
errors in
construct
responses)
Prevent or remove duplicate
Internal validity responses
(i.e., biased Consider effects of
responses) monetary compensation on
research questions
Examine time taken to
perform task
Spam Examine pattern of
responses
Include check questions
External Not perfectly representative
validity (i.e.,
of Internet users, but not
generalizability
worse than alternatives
)
9. Reliability & validity concerns
Research Mode 1: Data about Mode 2: Data about
concern Turkers research stimulus
Careful task design
Reliability (i.e., Prequalify Turkers
Use multiple indicators per
errors in Replicate work
construct
responses) Use AMT to validate
responses
Prevent or remove duplicate
Internal validity responses
(i.e., biased Consider effects of Careful task design
responses) monetary compensation on
research questions
Examine time taken to
Same as mode 1
perform task
Include gold standard data
Spam Examine pattern of
Compare responses to detect
responses
outliers
Include check questions
External Not perfectly representative
validity (i.e.,
of Internet users, but not N/A
generalizability
worse than alternatives
)
11. Reliability & validity concerns
Research Mode 1: Data about Mode 2: Data about Mode 3: Data about
concern Turkers research stimulus interaction
Careful task design Prequalify
Reliability (i.e., Turkers Use multiple indicators per
Use multiple indicators per
errors in Replicate work construct
construct
responses) Use AMT to validate Prequalify Turkers
responses
Prevent or remove duplicate Same as mode 1
Internal validity responses Design task to minimize
(i.e., biased Consider effects of demand
responses) monetary compensation on Minimize time to reduce
research questions discussion of experiment
Examine time taken to
Same as mode 1 Same as mode 1
perform task
Include gold standard data Include objective- answer
Spam Examine pattern of
Compare responses to detect questions that demonstrate
responses
outliers task performance
Include check questions
External Not perfectly representative
validity (i.e.,
of Internet users, but not N/A Same as mode 1
generalizability
worse than alternatives
)
16. Potential benefits & limitations
of AMT for research
of AMT for research
Low cost to recruit subjects
Amazon handles payments, so Turkers are anonymous
Possible to recruit a diverse subject population
Can easily recruit multiple subjects at one time for
collaborative experiments
Only basic features for selecting or filtering participants
No control over work setting or equipment
Hard to know how well Turker understands task
Limited opportunities for follow up
Concerns about reliability and validity of data
17. Reliability & validity concerns
Research concern Mode 3: Data about interaction
Reliability (i.e.,
Use multiple indicators per construct
errors in
Prequalify Turkers
responses)
Prevent or remove duplicate responses
Internal validity
Consider effects of monetary compensation on research questions
(i.e., biased
Design task to minimize demand
responses)
Minimize time to reduce discussion of experiment
Examine time taken to perform task
Examine pattern of responses
Spam
Include check questions
Include objective-answer questions that demonstrate task performance
External validity
(i.e., Not perfectly representative of Internet users, but not worse than alternatives
generalizability)
18. Conclusions
AMT can be a useful tool for research
Cheap quick access to a useful pool of subjects or
assistants
But need to be conscious of the issues in use
Issues depend on the kind of research you’re doing
Many of the issues are similar to other kinds of
research (e.g., reliability of measures)
Internal validity: Unique issue to AMT is spammers
External validity: Not perfectly representative but not
unrepresentative
19. Acknowledgements
Nathan Prestopnik and Andrea Wiggins
Developers: Gongying Pu, Shu Zhang, Trupti Rane,
Nathan Brown, Chris Duarte, Susan Furest, Yang Liu,
Nitin Mule, Sheila Sicilia, Jessica Smith, Peiyuan Sun,
Xueqing Xuan and Zhiruo Zhao
UMD: Anne Bowser, Jennifer Preece, Dana Rotman;
Smithsonian: Jennifer Hammock; Discover Life: Nancy
Lowe, John Pickering
NSF Grant 09–68470
Editor's Notes
AMT is a “marketplace for work that requires human intelligence”. Example of crowdsourcing, meaning “outsourcing a function to a large by undefined group of people via an open call“. Largest and best characterized. Tasks on AMT are typically small
Unit of work is called a HIT. Example of page of HITs--note that there are 2760 HITs available, many with multiple instances. Most are for small amounts of money. “25 percent of the HITs created on Mechanical Turk have a price tag of just US$0.01, 70 percent have a reward of $0.05 or less, and 90 percent pay less than $0.10”. average pay of US$4.80/hour for tasks. Too low and its slower to finish; too high seems bogus. Tasks can be done entirely on Amazon’s system or have a link to your own system. Results to poster include ID of Turker, answers to questions.
We want to distinguish different uses of AMT with associated research concerns. First case is to collect data about Turkers, e.g., as proxy for internet users on a survey or experiment. Turkers should only do the HIT once.
Reliability: as in any survey. Internal validity: studies of AMT have reported that answers seem to be truthful, when given. One-time survey doesn’t offer much opportunity for spam, but a survey can be quick, especially if you don’t read the questions before answering. Check question: when watching TV how often had you suffered a fatal heart attack? Same failure rate as in other surveys (about 5%) External validity is a key concern. Demographics of Turkers is a bit different than general population or population of Internet users Turkers were younger than average Internet users. The self-reported education was higher than average, but income lower. Most were single and without children. Furthermore, there are differences within the pool of Turkers, with resulting variability in other capabilities, e.g., the level of English abilityUS Turkers were about 2/3 female, while Indian Turkers were about 70% malemay not be appreciably less representative of the Internet or general population than other commonly-used subject pools, such as college students or subjects recruited on the InternetTurkers are human subjects for the research, so the rules and ethical principles that govern human subjects research apply, e.g., informed consent
second possibility for AMT research is that the researcher is studying some collection of objects that need humans to provide data about them. Turker could provide data on many. E.g., coding data; Karin Connely used AMT to check if messages had a name or not, or if a document has an answer to a question or not.
Reliability is main issue. Tasks have to be carefully defined since only training is what’s on AMT. Spam can be a big problem: estimate that 3 0% of the responses to a task they posted were provided by spammers; the spammers were a small number of the total, but posted many bogus responses. task should be designed “such that completing it accurately and in good faith requires as much or less effort than non-obvious random or malicious completion.”ethical concerns regarding the use of human subjects in research do not apply. Instead, the Turkers can be seen as out-sourced employees, raising a different set of concerns about the fairness of such employment
data come from interaction of people with a stimulus. For example, a common research use of AMT is to recruit users for tests of IT systems in order to get usage data and user feedbacksubjects’ responses to the stimulus are expected to be different, rather than simply reflecting an underlying truth inherent in the stimulus as in mode 2.
Validating subjective data for reliability is inherently difficult. Some of the techniques from the other modes may carry over. As in mode 1, it may be possible to use multiple items per construct to assess reliability. As in mode 2, careful task design and prequalification of Turkers will be useful. However, since many different answers could plausibly be correct [6], it is not possible to use “gold standard” data, to spot check results or to use replication to arrive at a consensus. These limitations would seem to limit the usefulness of AMT for interpretivist research in particular. potentially higher risk of demand, e.g., turkers being overly positive about a system because they think it will increase odds of getting paid For spam, one approach is to include a few questions that can be used to check that the work required for the task is actually being performed, even if the work itself can not be checked. For spam, one approach is to include a few questions that can be used to check that the work required for the task is actually being performed, even if the work itself can not be checked.
Key questions: can novices actually classify moths? will they do it?