1. BACKGROUND
The Lab:
The Language and Natural Reasoning Group (part of the Center for the Study
of Language and Information) studies the inferential properties of linguistic
expression to enable automated reasoning for natural language understanding.
The lab’s experiments utilize an Amazon service called Mechanical Turk, which
allows registered users within the US to participate in the study at the comfort of
their own computer. It also automates important tasks such as paying participants
and consolidating experimental data. Furthermore, it easily provides a larger and
more diverse sample than could be achieved at a college campus. The downside is
that participants cannot be physically monitored, and their answers to
demographic questions (such as native or first language) cannot be easily verified.
The Experiment:
I primarily worked on an experiment named E16, which built on previous
experiments to help determine subjects’ interpretations of certain adjectives under
specific circumstances. The most important sentences presented to a subject were
of the form:
“It was [not] [adjective] of [noun phrase] to [verb phrase]”
These target sentences’ most important variables are:
1. The adjective
2. The polarity (whether the modifier “not” is present)
3. Whether the sentence’s adjective is consonant, dissonant, or neither/neutral to
the verb phrase (an attribute abbreviated CND)
Example:
“Stephanie was not brave to fight dragons.”
This sentence is classified as negative consonant because adjective is modified
by not and the verb phrase to fight dragons is consonant with being brave.
The subject is then asked whether they believe, based on the sentence, that the
actor in the sentence did in fact commit the action described by the verb phrase or
not (e.g. “Stephanie fought dragons” vs. “Stephanie did not fight dragons”). They
may also choose “cannot decide.”
How the subject answers these questions determines whether they tend toward
a factive or implicative understanding of certain adjectives.
Factivity & Implicativity of Adjectives:
Designing and Coding a Linguistic Experiment For Amazon Mechanical Turk
EXPERIMENTAL SETUP
The lab stores the information used to create stimuli in CSV (comma-separated
value) documents. This information is used to create experimental trials via the
following procedure:
1. Convert CSV to JSON
Because the experiment relies heavily on JavaScript, the data must first be
converted to JSON (JavaScript Object Notation). Each CSV file has its own Python
script to do this, though each script follows the same procedure:
1. Read the CSV’s header to determine the fields it defines
2. For each row of data, create a dictionary mapping header-defined keys to values
3. Build a second dictionary whose fields are relevant only for creating experimental
trials later on
4. Use values from the first dictionary (raw or processed) to populate the second
dictionary
5. Append this dictionary to a multidimensional data structure whose nature
depends on how trials will be created later on
6. Write the contents of this data structure into a JSON file containing a single JSON
object
This is an improvement over the procedure for previous experiments, which
would simply translate the contents of the CVS file to a JSON file manually. This
required manually writing out many levels of brackets, quoted strings, etc. which is
difficult to edit.
The new procedure uses Python’s JSON library to make writing the new file
nearly trivial. This is technically less efficient, but is more suitable for this lab’s
relatively small datasets.
2. Build the Trials
A script named `create-trials.js` is run each time a subject begins the
experiment. It receives the JSON objects described above and selects trials to be
used within the experiment.
This selection is mostly random, though there are certain constraints that are
necessary to make the experiment’s results as useful as possible. For E16,
constraints on target stimuli include:
• Selecting only 6 adjectives out of the 23 available
• An equal distribution of consonant, neutral, and dissonant stimuli
• A positive-to-negative ratio of 1:3 for each adjective
Aside from the Target Stimuli, there are also Filler (Distractor) Stimuli and
Gold Standard Stimuli. The Fillers are mainly used to add variety to sentence
structure/content among the trials. Subjects’ answers to Fillers may be useful for
some analyses but they are not the main concern of the experiment. Gold
Standards primarily function to assess the subjects’ general comprehension of the
questions. Unlike Targets, Gold Standards have correct/incorrect answers, and an
incorrect answer indicates that the Subject’s should perhaps be excluded from the
final analyses.
Fillers and Gold Standards are randomly selected without constraints. However,
the distribution of Fillers and Gold Standards between Target sentences is
algorithmically pseudo-random.
Finally, some of the trials are assigned followup questions. These provide
some extra information, and depend on the subject’s answer to the trial’s primary
question. A followup question may ask whether the user would phrase an idea/
expression the same way as was presented by the stimulus, or why they answered
the primary question with “cannot decide.”
Only a certain number of Target and Filler trials are assigned followup
questions, and they are chosen randomly. This is done to collect as much useful
information as possible without making the experiment too long for subjects to
stay focused.
RUNNING THE EXPERIMENT
The experiment is ran in an HTML iframe element on the MTurk website. This
essentially means that the experiment runs on is its own independent website
within Amazon Mechanical Turk.
The majority of the website’s functionality comes from a script named
`experiment.js`, which serves a number of purposes. This script is derived from a
one that had been used in previous experiments, and is designed to be
maintainable, extensible, and brief in order to make the design and creation of
future experiments more streamlined. Here are the main features:
Statefulness
The website uses only one HTML page with many div elements, each of the
class slide. There is a global variable named `currentSlide` which dictates which
slide is visible and active. The potential values of `currentSlide` are provided by an
object called Slides that imitates an enumerable data type (JavaScript does not
provide an enumerable type natively):
var Slides = {
INSTRUCTIONS: “Instructions”,
EXAMPLES: “Examples”,
…
}
Thus the experiment can be said to always be in a particular slide state (e.g.
currentSlide = Slides.INSTRUCTIONS). This provides several important benefits:
• Moving between slide states is accomplished through a single function,
`changeSlides`, which can accomplish side-effects (such as recording
timestamps) depending on the target state.
• States can easily be added and removed to fit the needs of future
experiments.
• The states can be evaluated for equality, e.g. if currentSlide == Slides.START
…
Furthermore, individual states can themselves be stateful. For example, the
Stage slide, where the subject views and answers trial questions, has a state
defined by the type of question currently being asked, such as
StageStates.PRIMARY and StageStates.OTHER_FOLLOWUP.
Trial Data Storage
Each time a response is submitted, the data associated with it is stored in an
object named `currentResponse` with a fixed set of fields. These fields may be null
(valueless) but are never excluded, so that trial data fields are consistent.
One field of interest is `followup`, which is initially set to null. If the trial includes
a followup question, the response data for that question is recorded here as
another object (also with fixed, nullable fields).
Trial Event Time Data
Another field of interest is `eventTimes`, an array that holds information about
how long the Subject took to initially choose an answer, change an answer, submit
an answer, etc. Again, the range of information contained here can very easily be
edited for future experiments. This is accomplished by a single short function
whose only argument is the name of the event to be recorded with a timestamp.
Note that this timestamp in particular does not record the UTC time, but the
duration since the trial began, thus acting like a stopwatch.
This data can be used to determine information about individual questions and
subjects: How often does a subject change their mind before submitting an
answer? Which questions take the longest to answer? Are subjects getting fatigued
by the end of the study? Of course, this data is not entirely robust, but it is useful
since MTurk does not provide the luxury of physically monitoring subjects.
PERSONAL RESPONSIBILITIES & GOALS
My job within the lab dealt much more with designing and implementing
E16, as opposed to analyzing the results. The software used by the lab for
previous experiments was fully functional, but not very maintainable or easy
to change, especially for those unfamiliar with the code.
Thus my primary goal was to write the scripts for E16 based on
procedures used in the previous experiments, while assuring that the E16
scripts could easily be re-used for any subsequent experiment with only a
few minor changes. Additionally, I kept the new scripts as well-organized and
documented as possible so that anyone who had not encountered the
scripts before could edit them confidently and quickly.
Aaron King Advisors: Lauri Karttunnen, Stanley Peters, and Annie Zaenen