Customer Linguistic Profiling
Predicting Personality Traits
on Facebook’s statuses
Vishweshwara Keekan
Dmitrij Petrov
Dustin Nguyen
Agenda
1. Goals
2. Stylometry and its use-cases
3. Predicting Big 5 Personality traits
1. Split the dataset
2. Train and test statistical models
3. Evaluate the performance & show final results
4. Summary
Goals
•Getting into the field of stylometry & natural
language processing
•Conducting various data experiments on FB’s dataset
Non-Goals
•Achieving better results than existing studies
Stylometry
• Emerged in the second half of 19th century
• Wincenty Lutosławski coined it since 1897
• Def.: “the statistical analysis of literary style” dealing with “the study of
individual or group characteristics in written language” (e.g. sentence
length) Holmes & Kardos (2003), Knight (1993)
• Applied for authorship attribution & profiling, plagiarism etc.
Examples
• Authorship Identification in Greek Tweets
• Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek
popular users
• Character and word n-grams
• Forensic Stylometry for Anonymous Emails
• Frequent pattern technique
• Company email dataset containing 200,399 real-life emails from 158 employees
• Dream of the Red Chamber (1759) by Cao Xuegin
• First, a circulation of hand-written 80 chapters of novel
• Cheng-Gao’s first printed edition: 40 additional chapters being added
• The “chrono-devide” proven lastly via SVC-RFE with 10-50 features*
* Hu et al. (2014)
Supervised Machine Learning (S-ML)
• Dataset from MyPersonality.org project:
• 9917 Facebook’s status updates from 250 users
• Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or  remained)
• Statuses are/will be classified to Big-Five binary personality traits (Extroversion,
Agreeableness, Neuroticism, Openness to experience, Conscientiousness)
• S-ML (vs. Unsupervised ML)
• Dataset contains many input & (desired) output variables
• S-ML learns by examples and after several iterations is able to classify an input
Methodology & Tools
• Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc.
• Methodology of S-ML:
• Extract relevant stylometric (NLP) features
• Split dataset into training & testing set
• Train the model on the training set  Learn by examples
• Test the model on the ‘unseen’ set  Classify
• Validate the performance of the model  Evaluate
> Prepare data
*https://github.com/dmpe/CaseSolvingSeminar/
Extracted features from statuses
5 Labels
from ODS
Feature from
ODS
Extracted ones
Lexical (6) Character (8)
cNEU
STATUS
# functional words string length
lexical diversity [0-1]
# words # dots
# commas
cAGR
# personal pronouns
smileys
# semicolons
# colons
cOPN
Parts-of-speech Tags # *PROPNAME*
cCON
Bag-of-words (ngrams) average word length
cEXT
Splitting dataset using stratified k-fold CV
• Create 5 trait datasets based on our labels
• Use stratified k-fold cross-validation to split into the training and testing set
>>> train_X, test_X, train_Y, test_Y =
sk.cross_validation.train_test_split(agr[:,1:9],
agr["cAGR"],
train_size = 0.66, stratify = agr["cAGR"],
random_state = 5152)
Classification Metrics -> Confusion Matrix (1)
“Golden Standard”
(Real Truth Values)
Positive Negative
Observed
Predicted
positive
True
Positive
False
Positive
(Type 1
error)
Precision
Predicted
Negative
False
Negative
(Type 2
error)
True
Negative
Recall/
Sensitivity
(Specificity)
Accuracy =
TP + TN
TN + FP + FN + TP
Precision =
TP
FP + TP
Recall =
TP
FN + TP
F1-score = 2 ∗
precision ∗ recall
precision + recall
Learning and predicting
• Head-on approach: Classifiers only
# Assumption: features are numeric values only
classifier = MultinomialNB()
classifier.fit(train_X, train_Y).predict(test_X) # results in a prediction for test_X
• But: “Status” is a string and not numeric
nb_pipeline= Pipeline([
('vectorizer_tfidf', TfidfVectorizer(ngram_range=(1,2))),
('nb', MultinomialNB())
])
predicted = nb_pipeline.fit(train_X, train_Y).predict(test_X)
• Validation of results
scores = cross_validation.cross_val_score(
nb_pipeline, train_X + test_X, train_Y + test_Y, cv=10, scoring=‘accuracy’
)
accuracy, std_deviation = scores.mean(), scores.std() * 2
precision = average_precision_score(test_Y, predicted)
recall = recall_score(test_Y, predicted, labels=[False, True])
f1 = f1_score(test_Y, predicted, labels=[False, True])
Pipeline: Source Code Example
pipeline = sklearn.pipeline.Pipeline([
('features', sklearn.pipeline.FeatureUnion(
transformer_list=[
(‘status_string', sklearn.pipeline.Pipeline([ # tfidf on status
('tf_idf_vect', sklearn.feature_extraction.text.TfidfVectorizer()),
])),
('derived_numeric', sklearn.pipeline.Pipeline([ # aggregator creates derived values
(‘derived_cols', Aggregator([LexicalDiversity(), NumberOfFunctionalWords()])),
('scaler', sklearn.preprocessing.MinMaxScaler()),
])),
],
)),
(‘classifier_naive_bayes', sklearn.naive_bayes.MultinomialNB())
])
Parameter fine-tuning
• Most transformers and classifier accept different parameters
• Parameters can heavily influence the result
grid_params = {
'features__status_string__tf_idf_vect__ngram_range': ((1, 1), (1, 2), (2,3))}
}
grid_search = GridSearchCV(pipeline, param_grid=grid_params, cv=2, n_jobs=-1, verbose=0)
y_pred_trait = grid_search.fit(train_X, train_Y).predict(x_test)
# print best parameters
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(grid_parameter.keys()):
print("t%s: %r" % (param_name, best_parameters[param_name]))
Baseline 1: STATUS (TF-IDF) column only
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU 0.426 +/- 0.03 0.81 0.63 0.51 k-NN
OPN 0.747 +/- 0.03 0.998 0.87 0.854 Bernoulli-NB
AGR 0.585 +/- 0.03 0.91 0.76 0.698 Bernoulli-NB
EXT 0.600 +/- 0.03 0.45 0.60 0.48 Linear-SVC
CON 0.514 +/- 0.04 0.90 0.70 0.61 k-NN
Baseline 2: Derived columns
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU + 0.196 +/- 0.03 - 0.794 - 0.238 - 0.480 Bernoulli-NB
OPN - 0.004 +/- 0.03 + 0.002 + 0.001 - 0.001 Bernoulli-NB
AGR - 0.054 +/- 0.03 - 0.773 - 0.231 - 0.433 SVC
EXT - 0.015 +/- 0.03 - 0.273 - 0.071 - 0.215 Bernoulli-NB
CON + 0.025 +/- 0.04 - 0.5 - 0.114 - 0.167 Bernoulli-NB
Pipeline 3: Mix of STATUS and NON-STATUS cols.
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVC
OPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB
AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC
EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NN
CON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC
Results/Summary
• Hardly any improvement of head-first approach
• At least over the baseline
• Limited:
• strongly by Hardware & CPU
• grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels)
 rapidly growing effort
• grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes
• Future Research: look on GPU (NVIDIA)
• inconsistent data (multiple languages e.g. Spanish)

Customer Linguistic Profiling

  • 1.
    Customer Linguistic Profiling PredictingPersonality Traits on Facebook’s statuses Vishweshwara Keekan Dmitrij Petrov Dustin Nguyen
  • 2.
    Agenda 1. Goals 2. Stylometryand its use-cases 3. Predicting Big 5 Personality traits 1. Split the dataset 2. Train and test statistical models 3. Evaluate the performance & show final results 4. Summary
  • 3.
    Goals •Getting into thefield of stylometry & natural language processing •Conducting various data experiments on FB’s dataset Non-Goals •Achieving better results than existing studies
  • 4.
    Stylometry • Emerged inthe second half of 19th century • Wincenty Lutosławski coined it since 1897 • Def.: “the statistical analysis of literary style” dealing with “the study of individual or group characteristics in written language” (e.g. sentence length) Holmes & Kardos (2003), Knight (1993) • Applied for authorship attribution & profiling, plagiarism etc.
  • 5.
    Examples • Authorship Identificationin Greek Tweets • Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek popular users • Character and word n-grams • Forensic Stylometry for Anonymous Emails • Frequent pattern technique • Company email dataset containing 200,399 real-life emails from 158 employees • Dream of the Red Chamber (1759) by Cao Xuegin • First, a circulation of hand-written 80 chapters of novel • Cheng-Gao’s first printed edition: 40 additional chapters being added • The “chrono-devide” proven lastly via SVC-RFE with 10-50 features* * Hu et al. (2014)
  • 6.
    Supervised Machine Learning(S-ML) • Dataset from MyPersonality.org project: • 9917 Facebook’s status updates from 250 users • Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or  remained) • Statuses are/will be classified to Big-Five binary personality traits (Extroversion, Agreeableness, Neuroticism, Openness to experience, Conscientiousness) • S-ML (vs. Unsupervised ML) • Dataset contains many input & (desired) output variables • S-ML learns by examples and after several iterations is able to classify an input
  • 7.
    Methodology & Tools •Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc. • Methodology of S-ML: • Extract relevant stylometric (NLP) features • Split dataset into training & testing set • Train the model on the training set  Learn by examples • Test the model on the ‘unseen’ set  Classify • Validate the performance of the model  Evaluate > Prepare data *https://github.com/dmpe/CaseSolvingSeminar/
  • 8.
    Extracted features fromstatuses 5 Labels from ODS Feature from ODS Extracted ones Lexical (6) Character (8) cNEU STATUS # functional words string length lexical diversity [0-1] # words # dots # commas cAGR # personal pronouns smileys # semicolons # colons cOPN Parts-of-speech Tags # *PROPNAME* cCON Bag-of-words (ngrams) average word length cEXT
  • 9.
    Splitting dataset usingstratified k-fold CV • Create 5 trait datasets based on our labels • Use stratified k-fold cross-validation to split into the training and testing set >>> train_X, test_X, train_Y, test_Y = sk.cross_validation.train_test_split(agr[:,1:9], agr["cAGR"], train_size = 0.66, stratify = agr["cAGR"], random_state = 5152)
  • 10.
    Classification Metrics ->Confusion Matrix (1) “Golden Standard” (Real Truth Values) Positive Negative Observed Predicted positive True Positive False Positive (Type 1 error) Precision Predicted Negative False Negative (Type 2 error) True Negative Recall/ Sensitivity (Specificity) Accuracy = TP + TN TN + FP + FN + TP Precision = TP FP + TP Recall = TP FN + TP F1-score = 2 ∗ precision ∗ recall precision + recall
  • 11.
    Learning and predicting •Head-on approach: Classifiers only # Assumption: features are numeric values only classifier = MultinomialNB() classifier.fit(train_X, train_Y).predict(test_X) # results in a prediction for test_X • But: “Status” is a string and not numeric nb_pipeline= Pipeline([ ('vectorizer_tfidf', TfidfVectorizer(ngram_range=(1,2))), ('nb', MultinomialNB()) ]) predicted = nb_pipeline.fit(train_X, train_Y).predict(test_X) • Validation of results scores = cross_validation.cross_val_score( nb_pipeline, train_X + test_X, train_Y + test_Y, cv=10, scoring=‘accuracy’ ) accuracy, std_deviation = scores.mean(), scores.std() * 2 precision = average_precision_score(test_Y, predicted) recall = recall_score(test_Y, predicted, labels=[False, True]) f1 = f1_score(test_Y, predicted, labels=[False, True])
  • 12.
    Pipeline: Source CodeExample pipeline = sklearn.pipeline.Pipeline([ ('features', sklearn.pipeline.FeatureUnion( transformer_list=[ (‘status_string', sklearn.pipeline.Pipeline([ # tfidf on status ('tf_idf_vect', sklearn.feature_extraction.text.TfidfVectorizer()), ])), ('derived_numeric', sklearn.pipeline.Pipeline([ # aggregator creates derived values (‘derived_cols', Aggregator([LexicalDiversity(), NumberOfFunctionalWords()])), ('scaler', sklearn.preprocessing.MinMaxScaler()), ])), ], )), (‘classifier_naive_bayes', sklearn.naive_bayes.MultinomialNB()) ])
  • 13.
    Parameter fine-tuning • Mosttransformers and classifier accept different parameters • Parameters can heavily influence the result grid_params = { 'features__status_string__tf_idf_vect__ngram_range': ((1, 1), (1, 2), (2,3))} } grid_search = GridSearchCV(pipeline, param_grid=grid_params, cv=2, n_jobs=-1, verbose=0) y_pred_trait = grid_search.fit(train_X, train_Y).predict(x_test) # print best parameters best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(grid_parameter.keys()): print("t%s: %r" % (param_name, best_parameters[param_name]))
  • 14.
    Baseline 1: STATUS(TF-IDF) column only Trait Dataset Achieved Results with 10-fold CV Best Algorithm Accuracy Mean Accuracy Stand. Dev Recall Precision F1-score NEU 0.426 +/- 0.03 0.81 0.63 0.51 k-NN OPN 0.747 +/- 0.03 0.998 0.87 0.854 Bernoulli-NB AGR 0.585 +/- 0.03 0.91 0.76 0.698 Bernoulli-NB EXT 0.600 +/- 0.03 0.45 0.60 0.48 Linear-SVC CON 0.514 +/- 0.04 0.90 0.70 0.61 k-NN
  • 15.
    Baseline 2: Derivedcolumns Trait Dataset Achieved Results with 10-fold CV Best Algorithm Accuracy Mean Accuracy Stand. Dev Recall Precision F1-score NEU + 0.196 +/- 0.03 - 0.794 - 0.238 - 0.480 Bernoulli-NB OPN - 0.004 +/- 0.03 + 0.002 + 0.001 - 0.001 Bernoulli-NB AGR - 0.054 +/- 0.03 - 0.773 - 0.231 - 0.433 SVC EXT - 0.015 +/- 0.03 - 0.273 - 0.071 - 0.215 Bernoulli-NB CON + 0.025 +/- 0.04 - 0.5 - 0.114 - 0.167 Bernoulli-NB
  • 16.
    Pipeline 3: Mixof STATUS and NON-STATUS cols. Trait Dataset Achieved Results with 10-fold CV Best Algorithm Accuracy Mean Accuracy Stand. Dev Recall Precision F1-score NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVC OPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NN CON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC
  • 17.
    Results/Summary • Hardly anyimprovement of head-first approach • At least over the baseline • Limited: • strongly by Hardware & CPU • grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels)  rapidly growing effort • grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes • Future Research: look on GPU (NVIDIA) • inconsistent data (multiple languages e.g. Spanish)