Distributional Semantic word representation allows Natural Language Processing systems to extract and model an immense amount of information about a language. This technique maps words into a high dimensional continuous space through the use of a single-layer neural network. This process has allowed for advances in many Natural Language Processing research areas and tasks. These representation models are evaluated with the use of analogy tests, questions of the form ``If a is to a' then b is to what?'' are answered by composing multiple word vectors and searching the vector space. During the neural network training process, each word is examined as a member of its context. Generally, a word's context is considered to be the elements adjacent to it within a sentence. While some work has been conducted examining the effect of expanding this definition, very little exploration has been done in this area. Further, no inquiry has been conducted as to the specific linguistic competencies of these models or whether modifying their contexts impacts the information they extract. In this paper we propose a thorough analysis of the various lexical and grammatical competencies of distributional semantic models. We aim to leverage analogy tests to evaluate the most advanced distributional model across 14 different types of linguistic relationships. With this information we will then be able to investigate as to whether modifying the training context renders any differences in quality across any of these categories. Ideally we will be able to identify approaches to training that increase precision in some specific linguistic categories, which will allow us to investigate whether these improvements can be combined by joining the information used in different training approaches to build a single, improved, model.
3. Big Five Personality Inventory
(Norman, 1963; Goldberg, 1981)
1. Openness to experience
2. Extraversion
3. Conscientiousness
4. Emotional stability
(vs. Neuroticism)
5. Agreeableness
Language
Use
Agreeable
Stable
Conscientious
Open
Extraverted
4. Previous Works
1. Pennebaker and King (1999)
a. Self-report Essays Dataset with 2468 instances
2. Automatic Personality Prediction (Pennebaker et al., 2001) based
on text
a. Extracted linguistic features using Linguistic Inquiry and Word
Count (LIWC) text analysis tool
3. Mohammad and Kiritchenko (2013) introduced new linguistic
features
4. Tighe et al. (2016) applied feature reduction techniques like
Principal Component Analysis (PCA) and Information Gain (IG)
5. Motivation
● Applications in daily-life domains
○ Dating websites
○ Anti-terrorism
● Character mining
○ Attribute extraction
6. Objectives
● Create new Friends Dataset for the task
● Present a novel approach to automatic personality
prediction using attention-based neural networks with
word embeddings
● Evaluate our models on both datasets
8. Big Five Theories (John et al., 1991)
Big Five Traits Facets
Extraversion vs. introversion sociable, forceful, energetic, adventurous,
enthusiastic, outgoing
Agreeable vs. antagonism forgiving, not demanding, warm, not
stubborn, not show-off, sympathetic
Conscientiousness vs. lack of direction efficient, organized, not careless, thorough,
not lazy, not impulsive
Neuroticism vs. emotional stability tense, irritable, not contented, shy, moody,
not self-confident
openness vs. closeness to experience curious, imaginative, artistic, wide interest,
excitable, unconventional
9. Linguistic Inquiry and Word Count (LIWC)
Categories Examples
Past tense walked, were, had
Negations no, never, not
Swear words *****
Friends pal, buddy, coworker
Positive Emotions happy, pretty, good
Anger hate, kill, pissed
Assent agree, OK, yes
Nonfluencies uh, rr*
18. Friends Dataset
- Not domain-specific
- Simple language
Dataset Essays Friends EAR
Source written observation Spoken
Structure monologue dialogue dialogue
Report Type self-report observation
self-report &
observation
Number of
words
1.9 million 556,273 97,468
Instances 2,468 3,488 96
Words per
Instance
651 161 1015
19. Sub-scene Extraction Process
1. Use window technique to find
a main speaker’s frequency
distribution in each scene
2. Choose peaks in each
frequency distribution
3. Use the peaks to find the
index range of each
sub-scene
4. Extract multiple sub-scenes
from each scene, thus
increasing our data size
5. Optimize window_size and
min_conversation_length
20. Annotation through Crowdsourcing (online annotation)
- Extracted 8738
sub-scenes from
10-season Friends
transcript
- Had 3448 sub-scenes
annotated from the
first 4 seasons
through Amazon
Mechanical Turk
Platform
23. Final Annotation
Steps:
1. Not change initial annotations
[-1,1]
2. Add three annotators’ scores;
produce 7 classes [-3,3]
3. Classes -3 and 3 are too small
4. Merge -3 and -2, and 3 and 2;
produce 5 classes
24. Three formats of Friends dataset
Ross: Hi, Rachel.
Rachel: Hi Ross.
Ross: I have a bad day.
Rachel: Oh.
Ross: How is your day?
Original Conversation
Ross: Hi, Rachel.
Ross: I have a bad day.
Ross: How is your day?
Single
Ross: Hi, Rachel.
Ross: I have a bad day.
Ross: How is your day?
Rachel: Hi Ross.
Rachel: Oh.
Single+Context
#Targ# Ross: Hi, Rachel.
#NonTarg# Rachel: Hi
Ross.
#Targ# Ross: I have a
bad day.
#NonTarg# Rachel: Oh.
#Targ# Ross: How is
your day?
Target
25. Three formats of Friends dataset
Ross: Hi, Rachel.
Rachel: Hi Ross.
Ross: I have a bad day.
Rachel: Oh.
Ross: How is your day?
Original Conversation
Ross: Hi, Rachel.
Ross: I have a bad day.
Ross: How is your day?
Single
Ross: Hi, Rachel.
Ross: I have a bad day.
Ross: How is your day?
Rachel: Hi Ross.
Rachel: Oh.
Single+Context
#Targ# Ross: Hi, Rachel.
#NonTarg# Rachel: Hi
Ross.
#Targ# Ross: I have a bad
day.
#NonTarg# Rachel: Oh.
#Targ# Ross: How is your
day?
Target
31. 6. Conclusion
● New Friends dataset is created, and it shows the challenges of
annotating dialogue text data
● A novel approach to automatic personality prediction
● A new benchmark is achieved on essays dataset
● All models fail to work on Friends dataset, implicating the annotations
do not have much consistency
32. Future Works
● LIWC integrated CNN/LSTM with Attention Mechanism
● A platform to support human annotation process by providing
multimodal information
● The use of the Big Five Inventory Questionnaire