SlideShare a Scribd company logo
Semantic Analysis to Compute
Personality Traits from Social
Media Posts
Master Degree in Computer Engineering
DAUIN
Supervisor
Prof. Maurizio Morisio
Intership Tutor
Dott. Ing. Giuseppe Rizzo
Candidate
Giulio Carducci s225395
Personality
Is it possible to automatically compute the personality of an
individual from the language he/she uses in social networks??
2
Background – Lexical Hypothesis
Lexical Hypothesis
• Personality characteristics that are important to
a group of people will eventually become a part
of that group's language.
• Main personality characteristics of an individual
are more likely to be encoded into language as a
single word.
Sir Francis Galton
Galton, F. Measurement of Character. Fortnightly Review, 1884, 36:179-185.
3
Five Factor Model (FFM)
• Openness
inventive/curious vs. consistent/cautious
• Conscientiousness
efficient/organized vs. easy-going/careless
• Extraversion
outgoing/energetic vs. solitary/reserved
• Agreeableness
friendly/compassionate vs. challenging/detached
• Neuroticism
sensitive/nervous vs. secure/confident
Background – Personality
4
Social networks are rich sources of
information
Personality prediction from social
network data
• Page likes
• Number of followers/following
• Choice of profile picture
• Personal profile information
• ...
Background – Personality and Social Networks
5
myPersonality
• Up to 95% prediction accuracy
• Average accuracy of 77%
Background – Personality and Social Networks
6
Word Embedding denotes a set of NLP techniques where words are mapped
to vectors of real numbers.
‘cat’ 𝑥1, 𝑥2, 𝑥3, . . . , 𝑥 𝑛−1, 𝑥 𝑛 𝑛 = 300
Word embeddings can boost the performances of many NLP applications,
and have two main advantages over traditional word vectorization
techniques:
• Dimensionality reduction
Vector space of dimension 𝑛 instead of the number of distinct words
• Contextual similarity
Similar words are mapped to vectors that are close in the vector space
Background – Semantic Analysis
7
Similar words:
• Masculine/Feminine
• Verb forms
• States/Capitals
• Synonyms
• Similar concepts
• ...
Geometrical properties
𝑣𝑒𝑐𝑡𝑜𝑟 ′𝐾𝑖𝑛𝑔′
− 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝑀𝑎𝑛′
+ 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝑊𝑜𝑚𝑎𝑛′
≈ 𝑣𝑒𝑐𝑡𝑜𝑟(′𝑄𝑢𝑒𝑒𝑛′
)
𝑣𝑒𝑐𝑡𝑜𝑟 ′𝐼𝑡𝑎𝑙𝑦′ + 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝐶𝑎𝑝𝑖𝑡𝑎𝑙′
≈ 𝑣𝑒𝑐𝑡𝑜𝑟(′𝑅𝑜𝑚𝑒′
)
Background – Semantic Analysis
8
Experimental Setup – Overview
9
• Big 16,000,000 status updates
of 115,000 users
• Small 10,000 status updates
of 250 users
Statistics Value MIN MAX AVG
Status updates per user 1 39 223
Total words 146,128
Total words after
preprocessings
72,896
Distinct words 15,470
Distinct words after
preprocessing
15,185
Words per status update 1 14 113
Words per status update
after preprocessing
0 7 57
Experimental Setup – Gold Standard
MyPersonality Dataset
10
• 1 million word vectors
• 𝑛 = 300
• Trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news
dataset
• Trained with the continuous-bag-of-words (cbow) model from word2vec
• Ordered by descending frequency
• 95.08% word coverage on myPersonality
Experimental Setup – Word Embeddings
11
Geography
Stopwords
First names
Numbers
Verbs
Experimental Setup – Word Embeddings
12
• Conversion to lowercase
“Today is a #sunny day!” → “today is a #sunny day!”.
• Stop-words removal
“today is a #sunny day!” → “today #sunny day!”.
• Punctuation removal
“today #sunny day!” → “today sunny day”.
• Tokenization
“today sunny day” → [today] [sunny] [day].
• Short posts removal
All posts with less than 3 tokens are removed.
Removes noise and less-informative data
Experimental Setup – Text Preprocessing
13
Experimental Setup – Text Transformation
14
• Feed training data to the algorithm to compute a predictive model
• Training samples: ( 𝑣𝑒𝑐 𝑠𝑡𝑎𝑡𝑢𝑠 𝑢𝑝𝑑𝑎𝑡𝑒 , 𝐵𝐼𝐺5 𝑠𝑐𝑜𝑟𝑒)
• Supervised Learning: for each training sample, we specify the ground truth label
Linear Regression
𝒚 = 𝛽𝑿+ ∈
𝑦𝑖 = 𝛽01 + 𝛽1 𝒙𝑖1 + 𝛽2 𝑥𝑖2 + . . . +𝛽900 𝑥𝑖900 + 𝜖𝑖 𝑖 = 1,2, . . . , 𝑁
Least Absolute Shrinkage ans Selection Operator (LASSO)
𝑚𝑖𝑛
𝛽
1
2∗𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑿𝛽 − 𝑦 2
2
+ 𝛼 𝛽 1
Support Vector Machines (SVM)
𝒚 = 𝛽𝑿+ ∈
𝐽 𝛽 =
1
2
𝛽 𝑖=1
𝑁
𝜉𝑖 + 𝜉𝑖
∗
min 𝐽(𝛽)
Experimental Setup – Model Training
15
Type equation here.
• Also called Tuning of the hyperparameters
• Loss Function
Mean Squared Error (MSE) Mean squared difference between actual and
predicted values. Average value over 10-fold cross-validation
𝑀𝑆𝐸(𝑃, 𝐴) =
1
𝑛
𝑖=1
𝑛
𝑝𝑖 − 𝑎𝑖
2
𝑃 = (𝑝1, 𝑝2, … , 𝑝 𝑛)
𝐴 = (𝑎1, 𝑎2, … , 𝑎 𝑛)
Algorithm Parameter Value
SVM Kernel linear, rbf, poly
C 1, 10, 100
Gamma 0.01, 0.1, 1, 10
Degree 2,3
LASSO Alpha 1−15
, 1−10
, 1−8
, 1−5
, 1−4
, 1−3
, 1−2
, 1, 5, 10
SVM
Experimental Setup – Parameters Optimization
16
∈ ℝ+
: [0, +∞)
• Further cleaning steps applied before preprocessing:
∙ Pure retweets removal (= retweets with no added comment)
∙ URLs removal
∙ Mentions removal
• Preprocessing and transformation performed the same way as status updates
[𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900]
[𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900]
[𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900]
[0 − 5]
[0 − 5]
[0 − 5]
[0 − 5]
Clean Preprocess Transform
Experimental Setup – Personality Prediction
17
Trait
SVM Configuration
MSE
Kernel C Gamma
Openness rbf 1 1 0.3316
Conscientiousness rbf 10 1 0.5300
Extraversion rbf 10 1 0.7084
Agreeableness rbf 10 1 0.4477
Neuroticism rbf 10 10 0.5572
• Margin over Lreg: 8%
• Margin over LASSO: 1%
Method
MSE
Mean Std
Sum 0.6942 0.4862
Maximum 0.5350 0.0228
Minimum 0.5342 0.0230
Average 0.5366 0.0246
Concatenation 0.5364 0.0188
• Low mean MSE
• Lowest MSE std
Concatenation is more stable
with respect to other methods
Experimental Results – Algorithm and Transformation
18
MyPersonality big 16,000,000 status updates of 116,000 users
Same approach of myPersonality small
Training samples: ( 𝑣𝑒𝑐 𝑠𝑡𝑎𝑡𝑢𝑠 𝑢𝑝𝑑𝑎𝑡𝑒 , 𝐵𝐼𝐺5 𝑠𝑐𝑜𝑟𝑒)
Training time, Overfitting
Downsample
• 5000
• 10000
• 15000
• 20000
Dataset
Mean Squared Error
OPE CON EXT AGR NEU
MP small 0.3316 0.5300 0.7084 0.4477 0.5572
MP big (10k) 0.4184 0.5101 0.6971 0.4799 0.6459
MP big (20k) 0.4181 0.5066 0.6816 0.4773 0.6444
Experimental Results – MyPersonality Big
19
Statistic Value MIN AVG MAX
Total users 24
Total tweets 18,473
Tweets per user 9 769.7 2,252
Avg words per
tweet per user
5 6.8 8.8
• 26 participants
• 2 removed – not enough tweets
• Big Five Inventory (BFI, 44 items)
Experimental Results – Twitter Sample
20
Dataset
Mean Squared Error
OPE CON EXT AGR NEU
Twitter Sample (MP small) 0.3812 0.3129 0.3002 0.1319 0.2673
Twitter Sample (MP big) 0.3178 0.3236 0.4110 0.1362 0.2803
Literature* 0.4761 0.5776 0.7744 0.6241 0.7225
GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=username&count=200
statuses/user_timeline
• Returns: up to 200 tweets of username
• Format: json
• Rate limit: 1500/15 minutes
*Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J. Our Twitter Profiles, Our Selves: Predicting Personality with Twitter.
180-185. 10.1109/PASSAT/ SocialCom.2011.26.
Experimental Results – Twitter Sample
21
• Train word embeddings on textual data from social media
• Use a CNN for text transformation and prediction
• Expand the feature vector with additional semantic features
• Train multilingual word embeddings
• Test the approach on a bigger dataset
• Expand the Twitter sample
Future Work
22
~ Thank You ~
Appendix – Word2vec
...
[ 𝑚1 , 𝑚2, … , 𝑚300]
[ 𝑥1,1 , 𝑥2,1, … , 𝑥300,1]
𝑚1 = max 𝑥1,𝑖 𝑖 = 1,2, … , 𝑁
[ 𝑥1,2 , 𝑥2,2, … , 𝑥300,2] [ 𝑥1,𝑁 , 𝑥2,𝑁, … , 𝑥300,𝑁]...
Word Embedding 2Word Embedding 1 Word Embedding N
Appendix – Text Transformation (MAX)
i7-6700HQ @ 2.60 GHz
16GB RAM
Python
• Scikit-learn
• Numpy
• Pandas
• tweepy
Appendix – Technical Setup

More Related Content

Similar to Semantic Analysis to Compute Personality Traits from Social Media Posts

Rui Meng - 2017 - Deep Keyphrase Generation
Rui Meng - 2017 - Deep Keyphrase GenerationRui Meng - 2017 - Deep Keyphrase Generation
Rui Meng - 2017 - Deep Keyphrase Generation
Association for Computational Linguistics
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitter
azubiaga
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
Francesca Lazzeri, PhD
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
Dasha Herrmannova
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Daniel Hen
 
microposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfmicroposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdf
SunnySam26
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...
IIIT Hyderabad
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
QuantUniversity
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Jeffrey Nichols
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
Ganesan Narayanasamy
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
Value Amplify Consulting
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
Tetsuya Sakai
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Amazon Web Services
 
Jose Luis Fernandez-Marquez (UNIGE) - CCL tracker
Jose Luis Fernandez-Marquez (UNIGE) - CCL trackerJose Luis Fernandez-Marquez (UNIGE) - CCL tracker
Jose Luis Fernandez-Marquez (UNIGE) - CCL tracker
CitizenCyberlab
 
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
QA or the Highway
 
Topic2- Information Systems.pptx
Topic2- Information Systems.pptxTopic2- Information Systems.pptx
Topic2- Information Systems.pptx
CallplanetsDeveloper
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
Turi, Inc.
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Jason Riedy
 

Similar to Semantic Analysis to Compute Personality Traits from Social Media Posts (20)

Rui Meng - 2017 - Deep Keyphrase Generation
Rui Meng - 2017 - Deep Keyphrase GenerationRui Meng - 2017 - Deep Keyphrase Generation
Rui Meng - 2017 - Deep Keyphrase Generation
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitter
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
microposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfmicroposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdf
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...
Content Moderation Across Multiple Platforms with Capsule Networks and Co-Tra...
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
 
Jose Luis Fernandez-Marquez (UNIGE) - CCL tracker
Jose Luis Fernandez-Marquez (UNIGE) - CCL trackerJose Luis Fernandez-Marquez (UNIGE) - CCL tracker
Jose Luis Fernandez-Marquez (UNIGE) - CCL tracker
 
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas HaverThe Automation Firehose: Be Strategic and Tactical by Thomas Haver
The Automation Firehose: Be Strategic and Tactical by Thomas Haver
 
Topic2- Information Systems.pptx
Topic2- Information Systems.pptxTopic2- Information Systems.pptx
Topic2- Information Systems.pptx
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 

Recently uploaded

road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Semantic Analysis to Compute Personality Traits from Social Media Posts

  • 1. Semantic Analysis to Compute Personality Traits from Social Media Posts Master Degree in Computer Engineering DAUIN Supervisor Prof. Maurizio Morisio Intership Tutor Dott. Ing. Giuseppe Rizzo Candidate Giulio Carducci s225395
  • 2. Personality Is it possible to automatically compute the personality of an individual from the language he/she uses in social networks?? 2
  • 3. Background – Lexical Hypothesis Lexical Hypothesis • Personality characteristics that are important to a group of people will eventually become a part of that group's language. • Main personality characteristics of an individual are more likely to be encoded into language as a single word. Sir Francis Galton Galton, F. Measurement of Character. Fortnightly Review, 1884, 36:179-185. 3
  • 4. Five Factor Model (FFM) • Openness inventive/curious vs. consistent/cautious • Conscientiousness efficient/organized vs. easy-going/careless • Extraversion outgoing/energetic vs. solitary/reserved • Agreeableness friendly/compassionate vs. challenging/detached • Neuroticism sensitive/nervous vs. secure/confident Background – Personality 4
  • 5. Social networks are rich sources of information Personality prediction from social network data • Page likes • Number of followers/following • Choice of profile picture • Personal profile information • ... Background – Personality and Social Networks 5
  • 6. myPersonality • Up to 95% prediction accuracy • Average accuracy of 77% Background – Personality and Social Networks 6
  • 7. Word Embedding denotes a set of NLP techniques where words are mapped to vectors of real numbers. ‘cat’ 𝑥1, 𝑥2, 𝑥3, . . . , 𝑥 𝑛−1, 𝑥 𝑛 𝑛 = 300 Word embeddings can boost the performances of many NLP applications, and have two main advantages over traditional word vectorization techniques: • Dimensionality reduction Vector space of dimension 𝑛 instead of the number of distinct words • Contextual similarity Similar words are mapped to vectors that are close in the vector space Background – Semantic Analysis 7
  • 8. Similar words: • Masculine/Feminine • Verb forms • States/Capitals • Synonyms • Similar concepts • ... Geometrical properties 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝐾𝑖𝑛𝑔′ − 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝑀𝑎𝑛′ + 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝑊𝑜𝑚𝑎𝑛′ ≈ 𝑣𝑒𝑐𝑡𝑜𝑟(′𝑄𝑢𝑒𝑒𝑛′ ) 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝐼𝑡𝑎𝑙𝑦′ + 𝑣𝑒𝑐𝑡𝑜𝑟 ′𝐶𝑎𝑝𝑖𝑡𝑎𝑙′ ≈ 𝑣𝑒𝑐𝑡𝑜𝑟(′𝑅𝑜𝑚𝑒′ ) Background – Semantic Analysis 8
  • 10. • Big 16,000,000 status updates of 115,000 users • Small 10,000 status updates of 250 users Statistics Value MIN MAX AVG Status updates per user 1 39 223 Total words 146,128 Total words after preprocessings 72,896 Distinct words 15,470 Distinct words after preprocessing 15,185 Words per status update 1 14 113 Words per status update after preprocessing 0 7 57 Experimental Setup – Gold Standard MyPersonality Dataset 10
  • 11. • 1 million word vectors • 𝑛 = 300 • Trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset • Trained with the continuous-bag-of-words (cbow) model from word2vec • Ordered by descending frequency • 95.08% word coverage on myPersonality Experimental Setup – Word Embeddings 11
  • 13. • Conversion to lowercase “Today is a #sunny day!” → “today is a #sunny day!”. • Stop-words removal “today is a #sunny day!” → “today #sunny day!”. • Punctuation removal “today #sunny day!” → “today sunny day”. • Tokenization “today sunny day” → [today] [sunny] [day]. • Short posts removal All posts with less than 3 tokens are removed. Removes noise and less-informative data Experimental Setup – Text Preprocessing 13
  • 14. Experimental Setup – Text Transformation 14
  • 15. • Feed training data to the algorithm to compute a predictive model • Training samples: ( 𝑣𝑒𝑐 𝑠𝑡𝑎𝑡𝑢𝑠 𝑢𝑝𝑑𝑎𝑡𝑒 , 𝐵𝐼𝐺5 𝑠𝑐𝑜𝑟𝑒) • Supervised Learning: for each training sample, we specify the ground truth label Linear Regression 𝒚 = 𝛽𝑿+ ∈ 𝑦𝑖 = 𝛽01 + 𝛽1 𝒙𝑖1 + 𝛽2 𝑥𝑖2 + . . . +𝛽900 𝑥𝑖900 + 𝜖𝑖 𝑖 = 1,2, . . . , 𝑁 Least Absolute Shrinkage ans Selection Operator (LASSO) 𝑚𝑖𝑛 𝛽 1 2∗𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑿𝛽 − 𝑦 2 2 + 𝛼 𝛽 1 Support Vector Machines (SVM) 𝒚 = 𝛽𝑿+ ∈ 𝐽 𝛽 = 1 2 𝛽 𝑖=1 𝑁 𝜉𝑖 + 𝜉𝑖 ∗ min 𝐽(𝛽) Experimental Setup – Model Training 15
  • 16. Type equation here. • Also called Tuning of the hyperparameters • Loss Function Mean Squared Error (MSE) Mean squared difference between actual and predicted values. Average value over 10-fold cross-validation 𝑀𝑆𝐸(𝑃, 𝐴) = 1 𝑛 𝑖=1 𝑛 𝑝𝑖 − 𝑎𝑖 2 𝑃 = (𝑝1, 𝑝2, … , 𝑝 𝑛) 𝐴 = (𝑎1, 𝑎2, … , 𝑎 𝑛) Algorithm Parameter Value SVM Kernel linear, rbf, poly C 1, 10, 100 Gamma 0.01, 0.1, 1, 10 Degree 2,3 LASSO Alpha 1−15 , 1−10 , 1−8 , 1−5 , 1−4 , 1−3 , 1−2 , 1, 5, 10 SVM Experimental Setup – Parameters Optimization 16 ∈ ℝ+ : [0, +∞)
  • 17. • Further cleaning steps applied before preprocessing: ∙ Pure retweets removal (= retweets with no added comment) ∙ URLs removal ∙ Mentions removal • Preprocessing and transformation performed the same way as status updates [𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900] [𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900] [𝑥1, 𝑥2, 𝑥3, … , 𝑥899, 𝑥900] [0 − 5] [0 − 5] [0 − 5] [0 − 5] Clean Preprocess Transform Experimental Setup – Personality Prediction 17
  • 18. Trait SVM Configuration MSE Kernel C Gamma Openness rbf 1 1 0.3316 Conscientiousness rbf 10 1 0.5300 Extraversion rbf 10 1 0.7084 Agreeableness rbf 10 1 0.4477 Neuroticism rbf 10 10 0.5572 • Margin over Lreg: 8% • Margin over LASSO: 1% Method MSE Mean Std Sum 0.6942 0.4862 Maximum 0.5350 0.0228 Minimum 0.5342 0.0230 Average 0.5366 0.0246 Concatenation 0.5364 0.0188 • Low mean MSE • Lowest MSE std Concatenation is more stable with respect to other methods Experimental Results – Algorithm and Transformation 18
  • 19. MyPersonality big 16,000,000 status updates of 116,000 users Same approach of myPersonality small Training samples: ( 𝑣𝑒𝑐 𝑠𝑡𝑎𝑡𝑢𝑠 𝑢𝑝𝑑𝑎𝑡𝑒 , 𝐵𝐼𝐺5 𝑠𝑐𝑜𝑟𝑒) Training time, Overfitting Downsample • 5000 • 10000 • 15000 • 20000 Dataset Mean Squared Error OPE CON EXT AGR NEU MP small 0.3316 0.5300 0.7084 0.4477 0.5572 MP big (10k) 0.4184 0.5101 0.6971 0.4799 0.6459 MP big (20k) 0.4181 0.5066 0.6816 0.4773 0.6444 Experimental Results – MyPersonality Big 19
  • 20. Statistic Value MIN AVG MAX Total users 24 Total tweets 18,473 Tweets per user 9 769.7 2,252 Avg words per tweet per user 5 6.8 8.8 • 26 participants • 2 removed – not enough tweets • Big Five Inventory (BFI, 44 items) Experimental Results – Twitter Sample 20
  • 21. Dataset Mean Squared Error OPE CON EXT AGR NEU Twitter Sample (MP small) 0.3812 0.3129 0.3002 0.1319 0.2673 Twitter Sample (MP big) 0.3178 0.3236 0.4110 0.1362 0.2803 Literature* 0.4761 0.5776 0.7744 0.6241 0.7225 GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=username&count=200 statuses/user_timeline • Returns: up to 200 tweets of username • Format: json • Rate limit: 1500/15 minutes *Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J. Our Twitter Profiles, Our Selves: Predicting Personality with Twitter. 180-185. 10.1109/PASSAT/ SocialCom.2011.26. Experimental Results – Twitter Sample 21
  • 22. • Train word embeddings on textual data from social media • Use a CNN for text transformation and prediction • Expand the feature vector with additional semantic features • Train multilingual word embeddings • Test the approach on a bigger dataset • Expand the Twitter sample Future Work 22
  • 25. ... [ 𝑚1 , 𝑚2, … , 𝑚300] [ 𝑥1,1 , 𝑥2,1, … , 𝑥300,1] 𝑚1 = max 𝑥1,𝑖 𝑖 = 1,2, … , 𝑁 [ 𝑥1,2 , 𝑥2,2, … , 𝑥300,2] [ 𝑥1,𝑁 , 𝑥2,𝑁, … , 𝑥300,𝑁]... Word Embedding 2Word Embedding 1 Word Embedding N Appendix – Text Transformation (MAX)
  • 26. i7-6700HQ @ 2.60 GHz 16GB RAM Python • Scikit-learn • Numpy • Pandas • tweepy Appendix – Technical Setup

Editor's Notes

  1. In this thesis we analyze whether it is possible to automatically compute the personality of an individual by only relying on the language he uses in social
  2. It is the most widely accepted model of personality and it defines Low and high scores for the same trait indicate opposite tendencies
  3. Users share a great amount of digital content, and personality has been successfully predicted using many different input data
  4. As the recent case of CA
  5. The size of the vectors is usually set to 300
  6. And there is a spatial relationship between them WE also present geometrical properties
  7. This is possible thanks to the similarity of SU and tweets, which is further increased by additional tweet processing steps
  8. Each SU is labeled with the personality scores of the user who wrote it We first test our approach on MP small and then extend the analysis to MP big For this reason we expect to lose some predictive power
  9. Before transforming SU we preprocess them to remove noise and less-informative data
  10. Preprocessing segments a SU into a list of words We then compute, for each of the 300 vector components, max min and avg among all the word vectors of the SU, and concatenate them into a feature vector of size 900 that we use to train the models
  11. Sup. Learn. Approach because for each IN vector we also specify OUT value, that is the personality trait score
  12. Optimiz. Phase by training different models on the tr. set of MP and estimating their performance with MSE We implement 10fcv on the tr. set to test the models on the whole dataset We test 19 different combinations of the values reported in the table and observe that... So we use SVM to train the 5 predictive models
  13. To test the model on TW, we crawl the TW API to DL all the TW of a given user and clean them We compute the pers. Score of a user by averaging all the scores of his TW
  14. We report the SVM configs that performed the best in the optimization phase Mean err and mean std over the five traits
  15. We then extend the analysis to the whole MP dataset of 16M SU by using the same algorithms and configs used for MP small Compare the results of the two datasets on the same task
  16. To test the models on TW, we devise an experiment involving 26 participants who answered a pers. quest. And agreed to take part in it Same social and working environment
  17. We use TW to coompute pers. of participants and compare it with quest. results. We compare our results with those obtained by a study in literature This is probably because the TW user sample has very similar pers. Characteristics and is not various