Customer Linguistic Profiling

Customer Linguistic Profiling
Predicting Personality Traits
on Facebook’s statuses
Vishweshwara Keekan
Dmitrij Petrov
Dustin Nguyen

Agenda
1. Goals
2. Stylometry and its use-cases
3. Predicting Big 5 Personality traits
1. Split the dataset
2. Train and test statistical models
3. Evaluate the performance & show final results
4. Summary

Goals
•Getting into the field of stylometry & natural
language processing
•Conducting various data experiments on FB’s dataset
Non-Goals
•Achieving better results than existing studies

Stylometry
• Emerged in the second half of 19th century
• Wincenty Lutosławski coined it since 1897
• Def.: “the statistical analysis of literary style” dealing with “the study of
individual or group characteristics in written language” (e.g. sentence
length) Holmes & Kardos (2003), Knight (1993)
• Applied for authorship attribution & profiling, plagiarism etc.

Examples
• Authorship Identification in Greek Tweets
• Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek
popular users
• Character and word n-grams
• Forensic Stylometry for Anonymous Emails
• Frequent pattern technique
• Company email dataset containing 200,399 real-life emails from 158 employees
• Dream of the Red Chamber (1759) by Cao Xuegin
• First, a circulation of hand-written 80 chapters of novel
• Cheng-Gao’s first printed edition: 40 additional chapters being added
• The “chrono-devide” proven lastly via SVC-RFE with 10-50 features*
* Hu et al. (2014)

Supervised Machine Learning (S-ML)
• Dataset from MyPersonality.org project:
• 9917 Facebook’s status updates from 250 users
• Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or  remained)
• Statuses are/will be classified to Big-Five binary personality traits (Extroversion,
Agreeableness, Neuroticism, Openness to experience, Conscientiousness)
• S-ML (vs. Unsupervised ML)
• Dataset contains many input & (desired) output variables
• S-ML learns by examples and after several iterations is able to classify an input

Methodology & Tools
• Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc.
• Methodology of S-ML:
• Extract relevant stylometric (NLP) features
• Split dataset into training & testing set
• Train the model on the training set  Learn by examples
• Test the model on the ‘unseen’ set  Classify
• Validate the performance of the model  Evaluate
> Prepare data
*https://github.com/dmpe/CaseSolvingSeminar/

Extracted features from statuses
5 Labels
from ODS
Feature from
ODS
Extracted ones
Lexical (6) Character (8)
cNEU
STATUS
# functional words string length
lexical diversity [0-1]
# words # dots
# commas
cAGR
# personal pronouns
smileys
# semicolons
# colons
cOPN
Parts-of-speech Tags # *PROPNAME*
cCON
Bag-of-words (ngrams) average word length
cEXT

Splitting dataset using stratified k-fold CV
• Create 5 trait datasets based on our labels
• Use stratified k-fold cross-validation to split into the training and testing set
>>> train_X, test_X, train_Y, test_Y =
sk.cross_validation.train_test_split(agr[:,1:9],
agr["cAGR"],
train_size = 0.66, stratify = agr["cAGR"],
random_state = 5152)

Classification Metrics -> Confusion Matrix (1)
“Golden Standard”
(Real Truth Values)
Positive Negative
Observed
Predicted
positive
True
Positive
False
Positive
(Type 1
error)
Precision
Predicted
Negative
False
Negative
(Type 2
error)
True
Negative
Recall/
Sensitivity
(Specificity)
Accuracy =
TP + TN
TN + FP + FN + TP
Precision =
TP
FP + TP
Recall =
TP
FN + TP
F1-score = 2 ∗
precision ∗ recall
precision + recall

Learning and predicting
• Head-on approach: Classifiers only
# Assumption: features are numeric values only
classifier = MultinomialNB()
classifier.fit(train_X, train_Y).predict(test_X) # results in a prediction for test_X
• But: “Status” is a string and not numeric
nb_pipeline= Pipeline([
('vectorizer_tfidf', TfidfVectorizer(ngram_range=(1,2))),
('nb', MultinomialNB())
])
predicted = nb_pipeline.fit(train_X, train_Y).predict(test_X)
• Validation of results
scores = cross_validation.cross_val_score(
nb_pipeline, train_X + test_X, train_Y + test_Y, cv=10, scoring=‘accuracy’
)
accuracy, std_deviation = scores.mean(), scores.std() * 2
precision = average_precision_score(test_Y, predicted)
recall = recall_score(test_Y, predicted, labels=[False, True])
f1 = f1_score(test_Y, predicted, labels=[False, True])

Pipeline: Source Code Example
pipeline = sklearn.pipeline.Pipeline([
('features', sklearn.pipeline.FeatureUnion(
transformer_list=[
(‘status_string', sklearn.pipeline.Pipeline([ # tfidf on status
('tf_idf_vect', sklearn.feature_extraction.text.TfidfVectorizer()),
])),
('derived_numeric', sklearn.pipeline.Pipeline([ # aggregator creates derived values
(‘derived_cols', Aggregator([LexicalDiversity(), NumberOfFunctionalWords()])),
('scaler', sklearn.preprocessing.MinMaxScaler()),
])),
],
)),
(‘classifier_naive_bayes', sklearn.naive_bayes.MultinomialNB())
])

Parameter fine-tuning
• Most transformers and classifier accept different parameters
• Parameters can heavily influence the result
grid_params = {
'features__status_string__tf_idf_vect__ngram_range': ((1, 1), (1, 2), (2,3))}
}
grid_search = GridSearchCV(pipeline, param_grid=grid_params, cv=2, n_jobs=-1, verbose=0)
y_pred_trait = grid_search.fit(train_X, train_Y).predict(x_test)
# print best parameters
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(grid_parameter.keys()):
print("t%s: %r" % (param_name, best_parameters[param_name]))

Baseline 1: STATUS (TF-IDF) column only
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy
Mean
Accuracy
Stand. Dev
Recall Precision F1-score
NEU 0.426 +/- 0.03 0.81 0.63 0.51 k-NN
OPN 0.747 +/- 0.03 0.998 0.87 0.854 Bernoulli-NB
AGR 0.585 +/- 0.03 0.91 0.76 0.698 Bernoulli-NB
EXT 0.600 +/- 0.03 0.45 0.60 0.48 Linear-SVC
CON 0.514 +/- 0.04 0.90 0.70 0.61 k-NN

Baseline 2: Derived columns
Accuracy
Mean
Accuracy
Stand. Dev
NEU + 0.196 +/- 0.03 - 0.794 - 0.238 - 0.480 Bernoulli-NB
OPN - 0.004 +/- 0.03 + 0.002 + 0.001 - 0.001 Bernoulli-NB
AGR - 0.054 +/- 0.03 - 0.773 - 0.231 - 0.433 SVC
EXT - 0.015 +/- 0.03 - 0.273 - 0.071 - 0.215 Bernoulli-NB
CON + 0.025 +/- 0.04 - 0.5 - 0.114 - 0.167 Bernoulli-NB

Pipeline 3: Mix of STATUS and NON-STATUS cols.
Accuracy
Mean
Accuracy
Stand. Dev
NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVC
OPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB
AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC
EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NN
CON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC

Results/Summary
• Hardly any improvement of head-first approach
• At least over the baseline
• Limited:
• strongly by Hardware & CPU
• grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels)
 rapidly growing effort
• grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes
• Future Research: look on GPU (NVIDIA)
• inconsistent data (multiple languages e.g. Spanish)

Customer Linguistic Profiling

More Related Content

Similar to Customer Linguistic Profiling

More from F789GH

Recently uploaded

Customer Linguistic Profiling