SlideShare a Scribd company logo
1 of 1
Download to read offline
BACKGROUND
The Lab:
The Language and Natural Reasoning Group (part of the Center for the Study
of Language and Information) studies the inferential properties of linguistic
expression to enable automated reasoning for natural language understanding.
The lab’s experiments utilize an Amazon service called Mechanical Turk, which
allows registered users within the US to participate in the study at the comfort of
their own computer. It also automates important tasks such as paying participants
and consolidating experimental data. Furthermore, it easily provides a larger and
more diverse sample than could be achieved at a college campus. The downside is
that participants cannot be physically monitored, and their answers to
demographic questions (such as native or first language) cannot be easily verified.
The Experiment:
I primarily worked on an experiment named E16, which built on previous
experiments to help determine subjects’ interpretations of certain adjectives under
specific circumstances. The most important sentences presented to a subject were
of the form:
“It was [not] [adjective] of [noun phrase] to [verb phrase]”
These target sentences’ most important variables are:
1. The adjective
2. The polarity (whether the modifier “not” is present)
3. Whether the sentence’s adjective is consonant, dissonant, or neither/neutral to
the verb phrase (an attribute abbreviated CND)
Example:
“Stephanie was not brave to fight dragons.”
This sentence is classified as negative consonant because adjective is modified
by not and the verb phrase to fight dragons is consonant with being brave.
The subject is then asked whether they believe, based on the sentence, that the
actor in the sentence did in fact commit the action described by the verb phrase or
not (e.g. “Stephanie fought dragons” vs. “Stephanie did not fight dragons”). They
may also choose “cannot decide.”
How the subject answers these questions determines whether they tend toward
a factive or implicative understanding of certain adjectives.
Factivity & Implicativity of Adjectives:
Designing and Coding a Linguistic Experiment For Amazon Mechanical Turk
EXPERIMENTAL SETUP
The lab stores the information used to create stimuli in CSV (comma-separated
value) documents. This information is used to create experimental trials via the
following procedure:
1. Convert CSV to JSON
Because the experiment relies heavily on JavaScript, the data must first be
converted to JSON (JavaScript Object Notation). Each CSV file has its own Python
script to do this, though each script follows the same procedure:
1. Read the CSV’s header to determine the fields it defines
2. For each row of data, create a dictionary mapping header-defined keys to values
3. Build a second dictionary whose fields are relevant only for creating experimental
trials later on
4. Use values from the first dictionary (raw or processed) to populate the second
dictionary
5. Append this dictionary to a multidimensional data structure whose nature
depends on how trials will be created later on
6. Write the contents of this data structure into a JSON file containing a single JSON
object
This is an improvement over the procedure for previous experiments, which
would simply translate the contents of the CVS file to a JSON file manually. This
required manually writing out many levels of brackets, quoted strings, etc. which is
difficult to edit.
The new procedure uses Python’s JSON library to make writing the new file
nearly trivial. This is technically less efficient, but is more suitable for this lab’s
relatively small datasets.
2. Build the Trials
A script named `create-trials.js` is run each time a subject begins the
experiment. It receives the JSON objects described above and selects trials to be
used within the experiment.
This selection is mostly random, though there are certain constraints that are
necessary to make the experiment’s results as useful as possible. For E16,
constraints on target stimuli include:
• Selecting only 6 adjectives out of the 23 available
• An equal distribution of consonant, neutral, and dissonant stimuli
• A positive-to-negative ratio of 1:3 for each adjective
Aside from the Target Stimuli, there are also Filler (Distractor) Stimuli and
Gold Standard Stimuli. The Fillers are mainly used to add variety to sentence
structure/content among the trials. Subjects’ answers to Fillers may be useful for
some analyses but they are not the main concern of the experiment. Gold
Standards primarily function to assess the subjects’ general comprehension of the
questions. Unlike Targets, Gold Standards have correct/incorrect answers, and an
incorrect answer indicates that the Subject’s should perhaps be excluded from the
final analyses.
Fillers and Gold Standards are randomly selected without constraints. However,
the distribution of Fillers and Gold Standards between Target sentences is
algorithmically pseudo-random.
Finally, some of the trials are assigned followup questions. These provide
some extra information, and depend on the subject’s answer to the trial’s primary
question. A followup question may ask whether the user would phrase an idea/
expression the same way as was presented by the stimulus, or why they answered
the primary question with “cannot decide.”
Only a certain number of Target and Filler trials are assigned followup
questions, and they are chosen randomly. This is done to collect as much useful
information as possible without making the experiment too long for subjects to
stay focused.
RUNNING THE EXPERIMENT
The experiment is ran in an HTML iframe element on the MTurk website. This
essentially means that the experiment runs on is its own independent website
within Amazon Mechanical Turk.
The majority of the website’s functionality comes from a script named
`experiment.js`, which serves a number of purposes. This script is derived from a
one that had been used in previous experiments, and is designed to be
maintainable, extensible, and brief in order to make the design and creation of
future experiments more streamlined. Here are the main features:
Statefulness
The website uses only one HTML page with many div elements, each of the
class slide. There is a global variable named `currentSlide` which dictates which
slide is visible and active. The potential values of `currentSlide` are provided by an
object called Slides that imitates an enumerable data type (JavaScript does not
provide an enumerable type natively):
var Slides = {

INSTRUCTIONS: “Instructions”,

EXAMPLES: “Examples”,

…

}
Thus the experiment can be said to always be in a particular slide state (e.g.
currentSlide = Slides.INSTRUCTIONS). This provides several important benefits:
• Moving between slide states is accomplished through a single function,
`changeSlides`, which can accomplish side-effects (such as recording
timestamps) depending on the target state.
• States can easily be added and removed to fit the needs of future
experiments.
• The states can be evaluated for equality, e.g. if currentSlide == Slides.START
…
Furthermore, individual states can themselves be stateful. For example, the
Stage slide, where the subject views and answers trial questions, has a state
defined by the type of question currently being asked, such as
StageStates.PRIMARY and StageStates.OTHER_FOLLOWUP.
Trial Data Storage
Each time a response is submitted, the data associated with it is stored in an
object named `currentResponse` with a fixed set of fields. These fields may be null
(valueless) but are never excluded, so that trial data fields are consistent.
One field of interest is `followup`, which is initially set to null. If the trial includes
a followup question, the response data for that question is recorded here as
another object (also with fixed, nullable fields).
Trial Event Time Data
Another field of interest is `eventTimes`, an array that holds information about
how long the Subject took to initially choose an answer, change an answer, submit
an answer, etc. Again, the range of information contained here can very easily be
edited for future experiments. This is accomplished by a single short function
whose only argument is the name of the event to be recorded with a timestamp.
Note that this timestamp in particular does not record the UTC time, but the
duration since the trial began, thus acting like a stopwatch.
This data can be used to determine information about individual questions and
subjects: How often does a subject change their mind before submitting an
answer? Which questions take the longest to answer? Are subjects getting fatigued
by the end of the study? Of course, this data is not entirely robust, but it is useful
since MTurk does not provide the luxury of physically monitoring subjects.
PERSONAL RESPONSIBILITIES & GOALS
My job within the lab dealt much more with designing and implementing
E16, as opposed to analyzing the results. The software used by the lab for
previous experiments was fully functional, but not very maintainable or easy
to change, especially for those unfamiliar with the code.
Thus my primary goal was to write the scripts for E16 based on
procedures used in the previous experiments, while assuring that the E16
scripts could easily be re-used for any subsequent experiment with only a
few minor changes. Additionally, I kept the new scripts as well-organized and
documented as possible so that anyone who had not encountered the
scripts before could edit them confidently and quickly.
Aaron King Advisors: Lauri Karttunnen, Stanley Peters, and Annie Zaenen

More Related Content

What's hot

What's hot (20)

JUnit 5
JUnit 5JUnit 5
JUnit 5
 
JUNit Presentation
JUNit PresentationJUNit Presentation
JUNit Presentation
 
Best practices unit testing
Best practices unit testing Best practices unit testing
Best practices unit testing
 
Making property-based testing easier to read for humans
Making property-based testing easier to read for humansMaking property-based testing easier to read for humans
Making property-based testing easier to read for humans
 
Introduction to JUnit
Introduction to JUnitIntroduction to JUnit
Introduction to JUnit
 
Test driven development in .Net - 2010 + Eclipse
Test driven development in .Net - 2010 + EclipseTest driven development in .Net - 2010 + Eclipse
Test driven development in .Net - 2010 + Eclipse
 
Test driven development - JUnit basics and best practices
Test driven development - JUnit basics and best practicesTest driven development - JUnit basics and best practices
Test driven development - JUnit basics and best practices
 
Xp Day 080506 Unit Tests And Mocks
Xp Day 080506 Unit Tests And MocksXp Day 080506 Unit Tests And Mocks
Xp Day 080506 Unit Tests And Mocks
 
NUnit Features Presentation
NUnit Features PresentationNUnit Features Presentation
NUnit Features Presentation
 
JUnit- A Unit Testing Framework
JUnit- A Unit Testing FrameworkJUnit- A Unit Testing Framework
JUnit- A Unit Testing Framework
 
How and what to unit test
How and what to unit testHow and what to unit test
How and what to unit test
 
Unit Testing in Java
Unit Testing in JavaUnit Testing in Java
Unit Testing in Java
 
Unit testing with JUnit
Unit testing with JUnitUnit testing with JUnit
Unit testing with JUnit
 
Writing good unit test
Writing good unit testWriting good unit test
Writing good unit test
 
Unit testing with java
Unit testing with javaUnit testing with java
Unit testing with java
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Junit
JunitJunit
Junit
 
Junit and cactus
Junit and cactusJunit and cactus
Junit and cactus
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
3 j unit
3 j unit3 j unit
3 j unit
 

Similar to SummerPoster

Break through e2e-testing
Break through e2e-testingBreak through e2e-testing
Break through e2e-testing
tameemahmed5
 
[Pick the date] Class Instructor’s NameWhat Is Autism.docx
[Pick the date]  Class  Instructor’s NameWhat Is Autism.docx[Pick the date]  Class  Instructor’s NameWhat Is Autism.docx
[Pick the date] Class Instructor’s NameWhat Is Autism.docx
danielfoster65629
 
Test Case Naming 02
Test Case Naming 02Test Case Naming 02
Test Case Naming 02
SriluBalla
 
Data Driven Testing
Data Driven TestingData Driven Testing
Data Driven Testing
Maveryx
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
Alexander Decker
 
UNIT TESTING PPT
UNIT TESTING PPTUNIT TESTING PPT
UNIT TESTING PPT
suhasreddy1
 

Similar to SummerPoster (20)

Dev labs alliance top 20 testng interview questions for sdet
Dev labs alliance top 20 testng interview questions for sdetDev labs alliance top 20 testng interview questions for sdet
Dev labs alliance top 20 testng interview questions for sdet
 
Break through e2e-testing
Break through e2e-testingBreak through e2e-testing
Break through e2e-testing
 
[Pick the date] Class Instructor’s NameWhat Is Autism.docx
[Pick the date]  Class  Instructor’s NameWhat Is Autism.docx[Pick the date]  Class  Instructor’s NameWhat Is Autism.docx
[Pick the date] Class Instructor’s NameWhat Is Autism.docx
 
Testing Experience - Evolution of Test Automation Frameworks
Testing Experience - Evolution of Test Automation FrameworksTesting Experience - Evolution of Test Automation Frameworks
Testing Experience - Evolution of Test Automation Frameworks
 
Test automation principles, terminologies and implementations
Test automation principles, terminologies and implementationsTest automation principles, terminologies and implementations
Test automation principles, terminologies and implementations
 
Test analysis identifying test conditions
Test analysis identifying test conditionsTest analysis identifying test conditions
Test analysis identifying test conditions
 
Test Case Naming 02
Test Case Naming 02Test Case Naming 02
Test Case Naming 02
 
Data Driven Testing
Data Driven TestingData Driven Testing
Data Driven Testing
 
Nguyenvandungb seminar
Nguyenvandungb seminarNguyenvandungb seminar
Nguyenvandungb seminar
 
Java Unit Test - JUnit
Java Unit Test - JUnitJava Unit Test - JUnit
Java Unit Test - JUnit
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code execution
 
Test design techniques
Test design techniquesTest design techniques
Test design techniques
 
Best practices for test case creation & maintenance
Best practices for test case creation & maintenanceBest practices for test case creation & maintenance
Best practices for test case creation & maintenance
 
Word Tagging using Max Entropy Model and Feature selection
Word Tagging using Max Entropy Model and Feature selection Word Tagging using Max Entropy Model and Feature selection
Word Tagging using Max Entropy Model and Feature selection
 
UNIT TESTING PPT
UNIT TESTING PPTUNIT TESTING PPT
UNIT TESTING PPT
 
Test analysis: indentifying test conditions
Test analysis: indentifying test conditionsTest analysis: indentifying test conditions
Test analysis: indentifying test conditions
 
Unit Testing
Unit TestingUnit Testing
Unit Testing
 
Testing object oriented software.pptx
Testing object oriented software.pptxTesting object oriented software.pptx
Testing object oriented software.pptx
 
Testing with Junit4
Testing with Junit4Testing with Junit4
Testing with Junit4
 

SummerPoster

  • 1. BACKGROUND The Lab: The Language and Natural Reasoning Group (part of the Center for the Study of Language and Information) studies the inferential properties of linguistic expression to enable automated reasoning for natural language understanding. The lab’s experiments utilize an Amazon service called Mechanical Turk, which allows registered users within the US to participate in the study at the comfort of their own computer. It also automates important tasks such as paying participants and consolidating experimental data. Furthermore, it easily provides a larger and more diverse sample than could be achieved at a college campus. The downside is that participants cannot be physically monitored, and their answers to demographic questions (such as native or first language) cannot be easily verified. The Experiment: I primarily worked on an experiment named E16, which built on previous experiments to help determine subjects’ interpretations of certain adjectives under specific circumstances. The most important sentences presented to a subject were of the form: “It was [not] [adjective] of [noun phrase] to [verb phrase]” These target sentences’ most important variables are: 1. The adjective 2. The polarity (whether the modifier “not” is present) 3. Whether the sentence’s adjective is consonant, dissonant, or neither/neutral to the verb phrase (an attribute abbreviated CND) Example: “Stephanie was not brave to fight dragons.” This sentence is classified as negative consonant because adjective is modified by not and the verb phrase to fight dragons is consonant with being brave. The subject is then asked whether they believe, based on the sentence, that the actor in the sentence did in fact commit the action described by the verb phrase or not (e.g. “Stephanie fought dragons” vs. “Stephanie did not fight dragons”). They may also choose “cannot decide.” How the subject answers these questions determines whether they tend toward a factive or implicative understanding of certain adjectives. Factivity & Implicativity of Adjectives: Designing and Coding a Linguistic Experiment For Amazon Mechanical Turk EXPERIMENTAL SETUP The lab stores the information used to create stimuli in CSV (comma-separated value) documents. This information is used to create experimental trials via the following procedure: 1. Convert CSV to JSON Because the experiment relies heavily on JavaScript, the data must first be converted to JSON (JavaScript Object Notation). Each CSV file has its own Python script to do this, though each script follows the same procedure: 1. Read the CSV’s header to determine the fields it defines 2. For each row of data, create a dictionary mapping header-defined keys to values 3. Build a second dictionary whose fields are relevant only for creating experimental trials later on 4. Use values from the first dictionary (raw or processed) to populate the second dictionary 5. Append this dictionary to a multidimensional data structure whose nature depends on how trials will be created later on 6. Write the contents of this data structure into a JSON file containing a single JSON object This is an improvement over the procedure for previous experiments, which would simply translate the contents of the CVS file to a JSON file manually. This required manually writing out many levels of brackets, quoted strings, etc. which is difficult to edit. The new procedure uses Python’s JSON library to make writing the new file nearly trivial. This is technically less efficient, but is more suitable for this lab’s relatively small datasets. 2. Build the Trials A script named `create-trials.js` is run each time a subject begins the experiment. It receives the JSON objects described above and selects trials to be used within the experiment. This selection is mostly random, though there are certain constraints that are necessary to make the experiment’s results as useful as possible. For E16, constraints on target stimuli include: • Selecting only 6 adjectives out of the 23 available • An equal distribution of consonant, neutral, and dissonant stimuli • A positive-to-negative ratio of 1:3 for each adjective Aside from the Target Stimuli, there are also Filler (Distractor) Stimuli and Gold Standard Stimuli. The Fillers are mainly used to add variety to sentence structure/content among the trials. Subjects’ answers to Fillers may be useful for some analyses but they are not the main concern of the experiment. Gold Standards primarily function to assess the subjects’ general comprehension of the questions. Unlike Targets, Gold Standards have correct/incorrect answers, and an incorrect answer indicates that the Subject’s should perhaps be excluded from the final analyses. Fillers and Gold Standards are randomly selected without constraints. However, the distribution of Fillers and Gold Standards between Target sentences is algorithmically pseudo-random. Finally, some of the trials are assigned followup questions. These provide some extra information, and depend on the subject’s answer to the trial’s primary question. A followup question may ask whether the user would phrase an idea/ expression the same way as was presented by the stimulus, or why they answered the primary question with “cannot decide.” Only a certain number of Target and Filler trials are assigned followup questions, and they are chosen randomly. This is done to collect as much useful information as possible without making the experiment too long for subjects to stay focused. RUNNING THE EXPERIMENT The experiment is ran in an HTML iframe element on the MTurk website. This essentially means that the experiment runs on is its own independent website within Amazon Mechanical Turk. The majority of the website’s functionality comes from a script named `experiment.js`, which serves a number of purposes. This script is derived from a one that had been used in previous experiments, and is designed to be maintainable, extensible, and brief in order to make the design and creation of future experiments more streamlined. Here are the main features: Statefulness The website uses only one HTML page with many div elements, each of the class slide. There is a global variable named `currentSlide` which dictates which slide is visible and active. The potential values of `currentSlide` are provided by an object called Slides that imitates an enumerable data type (JavaScript does not provide an enumerable type natively): var Slides = {
 INSTRUCTIONS: “Instructions”,
 EXAMPLES: “Examples”,
 …
 } Thus the experiment can be said to always be in a particular slide state (e.g. currentSlide = Slides.INSTRUCTIONS). This provides several important benefits: • Moving between slide states is accomplished through a single function, `changeSlides`, which can accomplish side-effects (such as recording timestamps) depending on the target state. • States can easily be added and removed to fit the needs of future experiments. • The states can be evaluated for equality, e.g. if currentSlide == Slides.START … Furthermore, individual states can themselves be stateful. For example, the Stage slide, where the subject views and answers trial questions, has a state defined by the type of question currently being asked, such as StageStates.PRIMARY and StageStates.OTHER_FOLLOWUP. Trial Data Storage Each time a response is submitted, the data associated with it is stored in an object named `currentResponse` with a fixed set of fields. These fields may be null (valueless) but are never excluded, so that trial data fields are consistent. One field of interest is `followup`, which is initially set to null. If the trial includes a followup question, the response data for that question is recorded here as another object (also with fixed, nullable fields). Trial Event Time Data Another field of interest is `eventTimes`, an array that holds information about how long the Subject took to initially choose an answer, change an answer, submit an answer, etc. Again, the range of information contained here can very easily be edited for future experiments. This is accomplished by a single short function whose only argument is the name of the event to be recorded with a timestamp. Note that this timestamp in particular does not record the UTC time, but the duration since the trial began, thus acting like a stopwatch. This data can be used to determine information about individual questions and subjects: How often does a subject change their mind before submitting an answer? Which questions take the longest to answer? Are subjects getting fatigued by the end of the study? Of course, this data is not entirely robust, but it is useful since MTurk does not provide the luxury of physically monitoring subjects. PERSONAL RESPONSIBILITIES & GOALS My job within the lab dealt much more with designing and implementing E16, as opposed to analyzing the results. The software used by the lab for previous experiments was fully functional, but not very maintainable or easy to change, especially for those unfamiliar with the code. Thus my primary goal was to write the scripts for E16 based on procedures used in the previous experiments, while assuring that the E16 scripts could easily be re-used for any subsequent experiment with only a few minor changes. Additionally, I kept the new scripts as well-organized and documented as possible so that anyone who had not encountered the scripts before could edit them confidently and quickly. Aaron King Advisors: Lauri Karttunnen, Stanley Peters, and Annie Zaenen