SlideShare a Scribd company logo
1
Text mining Online Social Networks for Personality Classification
Farzad Golnoori1
, Mohammad Karim Sohraby2
, and Farzin Yaghmaei3
1
Department of Computer Engineering Science and Research Branch, Islamic Azad university Semnan,Iran
, farzadgolnoori@yahoo.com
2
Department of Computer Engineering Science and Research Branch, Islamic Azad university Semnan,Iran
, Amir_sohraby@yahoo.com
3
Department of Electrical and Computer Engineering Semnan University, Semnan,Iran
, f_yaghmaee@semnan.ac.ir
Abstract: Today's online social networks are one the major application programs among internet users. The user of
these networks daily expresses their tastes, interest and feelings in these networks. Among these, shared texts by users
can be important and rich sources for investigating current user's behaviour and personality traits in these networks. In
this study we investigated existing text in social networks, one of the existing diverse data in online social network in
order to classify user's personality traits. For this purpose, we built great corpus about 9900 status update, related to 250
existing user in face book social network, then with using this source, different datasets built according to extractive
traits from text and finally we used RBF neural network for classify user's personality traits .The results show that for
personality traits classification ,RBF neural network have high precision than common classification such as SVM,
Naïve Bayes.
Keywords: Online social network, Text mining, Personality, User modeling.
1. Introduction
Today with growing use of Internet and web application
program, achieving information about user's behavior in web,
in applications which the user's modeling play's critical role
like recommender systems, personalized systems, or in
applications like targeted marketing, have significant
importance. User's model Based on the system application can
be based on personal information, like user's name and age,
skills, knowledge, programs and purposes, preferences,
disaffection or information about user's behavior and
personality[1]. in this area the user's personality is one of the
interesting features.
The personality of an individual can be defined as a set of
features that induces a tendency on the behavior of the
individual.this tendency is stable through time and situations
.Knowing the personality of a given person provides hints
about how he would probably react when facing different
situations [2]. Research in the psychology literature has led to a
well established model for personality recognition and
description, called the Big Five Personality Model[3]. Five
traits can be summarized in the following way:
 Extraversion measures a tendency to seek stimulation in
the external world, the company of others, and to express
positive emotions. Extroverts tend to be more outgoing,
friendly, and socially active. They are usually energetic and
talkative; they do not mind being at the center of attention,
and make new friends more easily. Introverts are more
likely to be solitary or reserved and seek environments
characterized by lower levels of external stimulation.
 Conscientiousness measures preference for an organized
approach to life in contrast to a spontaneous one.
Conscientious people are more likely to be well organized,
reliable, and consistent. They enjoy planning, seek
achievements, and pursue long-term goals. Non-
conscientious individuals are generally more easy-going,
spontaneous, and creative. They tend to be more tolerant
and less bound by rules and plans.
 Openness to experience (Openness) is related to
imagination, creativity, curiosity, tolerance, political
liberalism, and appreciation for culture. People scoring high
on Openness like change, appreciate new and unusual
ideas, and have a good sense of aesthetics.
 Agreeableness relates to a focus on maintaining positive
social relations, being friendly, compassionate, and
cooperative. Agreeable people tend to trust others and adapt
to their needs. Disagreeable people are more focused on
themselves, less likely to compromise, and may be less
gullible. They also tend to be less bound by social
expectations and conventions, and more assertive.
 Neuroticism (reversely referred to as Emotional Stability)
measures the tendency to experience mood swings and
emotions such as guilt, anger, anx-iety, and depression.
Emotionally unstable (neurotic) people are more likely to
experience stress and nervousness, while emotionally stable
people (low Neuroticism) tend to be calmer and self-
confident.
The most commonly used procedure to obtain this
information consists of asking the user to fill in
questionnaires. However, users can find this task too time-
consuming, since most of the personality questionnaires
include many questions to answer in order to obtain an
accurate user profile [2],[3].
Today's social network like face book are rich sources from
text in different forms. Users in face book can be updated
status, shared comment on it's friends wall or shared comment
2
on other's user post. In this area one of the most popular
features used in face book is user status, which can be said this
capability are small blogs for describing person's views,
feeling, beliefs and behavior. So user status potentially
containing information about person's personality in
facebook[4].
However, in social networking websites, people generally
use unstructured or semi-structured language for
communication. In everyday life conversation, people do not
care about the spellings and accurate grammatical construction
of a sentence that may leads to different types of ambiguities,
such as lexical, syntactic, and semantic [6].Therefore,
extracting logical patterns with accurate information from
such unstructured form is a critical task to perform.Text
mining can be a solution of above mentioned problems.
Text mining refer to textual data analysis by machine
learning technique, intelligence information recovery, natural
language processing or other's related methods to extract and
discover knowledge from text [6].On the other hand, with
respect to that face book and other social network in recent
years, set many laws in order to maintain user's privacy, in this
area text can be as one achievable sources than other used data
in online social networks. main purpose of this study, is using
existing text in social networks and investigating the power of
extractable features from them, without using another kinds of
information about user, like related information to user's use
of social network (the number of status, the number of joined
groups, the number of Likes) or structural information related
to user's egocentric like number of friend or criteria such as
betweenness and density in order to classify personality traits.
The main question is whether with having special user's
status's sample in face book and or user's tweets in twitter can
be achieved to useful information about user's personality .
2. Related Works
In recent years there have been many different attempts to
automatically classify personality traits from text or from
other cues, like social network usage.In [8] classified
extraversion, stability, agreeableness and conscientiousness of
blog authors using n-grams as features and Naive Bayes (NB)
as learning algorithm. They reported that binary classes and
automatic feature selection yield the best improvement over
the baseline.In [9] ran personality recognition in both
conversation (using observer judjements) and text (using self
assessments via Big5). They exploited two lexical resources as
features, LIWC and MRC , and predicted both personality
scores and classes using Support Vector Machines (SVMs)
and M5 trees respectively. They also reported a long list of
correlations between Big5 personality traits and two lexical
resources they used.In [10] used as features word n-grams
extracted from a large corpus of blogs, testing different
extraction settings, such as the presence/ absence of stop
words or inverse document frequency.They found that
bigrams, treated as boolean features and keeping stop words,
yield very good results using SVMs as learning algorithm,
although the features extracted are few in a very large corpus.
As for the extraction of personality recognition from social
network sites [2] with using related parameters to face book
social network users activity like the friends number, posts
number in last month, the months number that user begin
his/her activity in face book and with using Decision tree
algorithm, personality trait classifier built in two case, 3-class
and 5-class which 3- class case (low, high,middle) with 70%
accuracy for all of personality traits reports having higher
precision. According to this issue which in many studies,
textual data correlation with personality traits are proved, in
some works existing texts in user's profile or existing texts in
posts and tweets, beside other existing traits in social
networks like structural information, personal information,
behavioral information [11], user's interests and preferences
(the matrix of user's likes) [12], cultural information,
information about the person's living place (like Ethnicity
distribution, the average house's price, average income) [13],
viewed as a tool for personality extraction.
In most of these works used text's analysis tools Like
LIWC [14] for desired features extraction. This software
measured predefined categories of words usage in all over the
text. According to our knowledge, the following study is the
first task to use text alone, and text mining method as power
tool for predicting user personality traits in online social
networks fields. In this regard according to other's text
classification which have two major stage first, better
predictable features extracted from text and then with using
machine learning Algorithm, documents (user's status set) are
classified.
3. Methodology
One of the Major application of text mining is text
classification. Text classification assign a document to a
predefined category of documents.
Particularly, if we have set of labeled documents from data
set D={d1,d2,...,dn} belonging to the set of categories
C={c1, c2,..., cp}, Text classification duty is training classifiers
with using these documents and assigning new (not observed)
documents to specified categories [15]. In this work, we used
about 9900 status update, related to 250 face book user,
collected with my personality project[12] to evaluate methods.
We turn all of the sent statuses by user to a similar text for per
user, with this work, for any user who is in data set, we have
one text containing all of the user's posted statuses in the
dataset.In this work text classification duty is assigning text
set ( user's statuses), to low or high category for each user
personality trait. our approach is performed through the
following main steps.
3.1 Preprocessing
Preprocess phase ,prepared statuses for classification
procedure, which in these, labels and stop words are omitted
then stemming to the rest of text perform in document.
The stop words are words that do not add meaningful
3
content to the data set (i.e., pronouns, prepositions,
conjunctions, etc). Consequently, removing them reduces,
significantly, the space of the items in the training and testing
texts, and simplifies the targeted analysis. Stemming is the
process of removing prefixes and suffixes leaving the stem or
the root of the considered words.
3.2 Feature Extraction
Textual documents must be displayed in the way that
classifier able to interpret them. The two main approaches of
text representation are the Bag-of-Words Model(BOW) and the
Vector Space Model (VSM)[15]. In BOW model,each word is
represented as a separate variable having numeric weight.
VSM is now recognized as the best text representation model.
Its basic idea is to represent the document as a presence vector
in which feature term is weighted as component.Term's weight
can be binary or Decimal .In the case of binary, 0 used to show
absence of term and 1 used to indicate the presence of term in
desired document.
When the weights are non-binary, weights calculated With
statistical and probabilistic techniques. One of the most popular
term's weight calculation functions, is tf*idf [14]. This method
viewed frequency of one word in one document against it's
frequency in all of the documents set .One of the main steps of
feature extraction is n-gram conversion [16]. The n-gram
conversion consists of extracting a bag-of words representation
of the text’s field .In this work we used unigram
,bigram,trigram as features.
3.3 Feature Selection
The next step is selecting suitable features spaces among
terms in document, which this stage is vital stage in this
process and system's precision have high dependency to
selected keys which indicate document.we used one feature
selection method based on filter and using information Gain
Ranking Criteria for selecting features with more capability of
prediction.
3.4 Classification
In this study we used RBF neural network for classifying
user's statuses .Radial Basis Function (RBF networks) is the
artificial neural network type for application of supervised
learning problem [17]. By using RBF networks, the training of
networks is relatively fast due to the simple structure of RBF
networks. Other than that, RBF networks are also capable of
universal approximation with non-restrictive assumptions
[19]. The RBF networks can be implemented in any types of
model whether linear on non-linear and in any kind of network
whether single or multilayer [18] Generally, for neural
network training, documents divided to train and test
document which train document used for training system and
test documents for evaluating system.
Due to data set smallness, instead of data dividing to two
train and test part, we used 10 fold cross validation for
measuring effectiveness of the neural network. In this method,
data set divided to 10 subset and each time the analysis
performed on one set while the rest of data play training data
role. precision will be equal to mean resulted precision in this
10 stage .
4. Experiments And Results
In this work, for each personality trait, three data set built
with using unigram , bigram and trigram and binary
weighting. All of this procedure performed in Weka toolkit
[17] and with using string to vector filter . For stemming used
snow ball and for token determining used NGram
Tokenizer.To build RBF neural network we used RBF
implementation in Weka toolkit. For each personality trait,
three RBF neural network built with using three data set.
In order to evaluate method, we compared obtained
precision and recall from applying RBF text classifier with
other popular classifiers in text classification field like SVM,
Naïve Bayes.Two criteria definition are follow:
ba
a
ecision

Pr (1)
ca
a
call

Re
(2)
a= the number of texts assigned to one category correctly
b= the number of texts that assigned to one category
incorrectly
c= the number of texts that reject from one category
incorrectly
According to indicated results in below tables, it is
observed that for Each five personality trait in using bigram
model(b2) than unigram(b1) and trigram(b3), model's
precision increased. Also in all of experience cases the using
RBF neural network as a classifier have high precision than
Naïve Bayes and SVM classifiers. Better obtained precision
for extraversion personality trait equal to .945, for neuroticism
personality trait equal to .931, for agreeableness personality
trait equal to .894, for Conscientiousness personality trait
equal to .949 and for openness personality trait equal to .931
which among the hardest personality trait for classifier based
on RBF was agreeableness personality trait and the easiest
trait was Conscientiousness. The results show that using
trigram for openness and extraversion personality trait have a
better effectiveness than unigram, while about
Conscientiousness, agreeableness, neuroticism, unigram have
a better effectiveness than trigram. Also about comparing
precision and recall of two SVM, Naive Bayse classifier,
using Naïve Bayes for classifying openness (Naive Bayes-
b3), Agreeableness (Naive Bayes-b2), neuroticism (Naive
Bayes-b1), extraversion (Naive Bayes-b2) have better
effectiveness than using SVM, while about Conscientiousness
personality trait, using SVM(SVM-b2) have a better
effectiveness than Naïve Bayes.
4
TABLE I. EXTRAVERSION'S CLASSIFYING RESULTS
Classifier-Feature Precision Recall
RBF-b1 0.893 0.888
RBF-b2 0.945 0.94
RBF-b3 0.905 0.892
SVM-b1 0.804 0.8
SVM-b2 0.846 0.82
SVM-b3 0.826 0.772
NaiveBayes-b1 0.813 0.808
NaiveBayes-b2 0.854 0.844
NaiveBayes-b3 0.819 0.8
TABLE II. NEUROTICISM'S CLASSIFYING RESULTS
Classifier-Feature Precision Recall
RBF-b1 0.903 0.872
RBF-b2 0.931 0.916
RBF-b3 0.854 0.768
SVM-b1 0.764 0.748
SVM-b2 0.856 0.82
SVM-b3 0.814 0.732
NaiveBayes-b1 0.878 0.856
NaiveBayes-b2 0.863 0.832
NaiveBayes-b3 0.827 0.772
TABLE III. AGREEABLENESS'S CLASSIFYING RESULTS
Classifier-Feature Precision Recall
RBF-b1 0.88 0.852
RBF-b2 0.894 0.868
RBF-b3 0.859 0.808
SVM-b1 0.758 0.748
SVM-b2 0.847 0.796
SVM-b3 0.808 0.7
NaiveBayes-b1 0.813 0.784
NaiveBayes-b2 0.882 0.864
NaiveBayes-b3 0.728 0.728
TABLE IV. Conscientiousness's CLASSIFYING Results
Classifier-Feature Precision Recall
RBF-b1 0.932 0.924
RBF-b2 0.949 0.944
RBF-b3 0.887 0.856
SVM-b1 0.802 0.78
SVM-b2 0.834 0.796
SVM-b3 0.816 0.748
NaiveBayes-b1 0.777 0.764
NaiveBayes-b2 0.805 0.784
NaiveBayes-b3 0.821 0.776
TABLE V. OPENNESS'S CLASSIFYING RESULTS
Classifier-Feature Precision Recall
RBF-b1 0.912 0.9
RBF-b2 0.931 0.924
RBF-b3 0.925 0.916
SVM-b1 0.834 0.816
SVM-b2 0.832 0.796
SVM-b3 0.837 0.788
NaiveBayes-b1 0.85 0.772
NaiveBayes-b2 0.861 0.836
NaiveBayes-b3 0.875 0.848
5. Conclusion
In this study we explored existing text in social Network
and text mining methods as a tool for facebook users's
personality traits classification. Our main purpose was to
investigate relationship among special words with social
network user's personality traits. We build three data set
according to text indicative terms (unigram, bigram, trigram)
and terms scoring (binary) for each personality trait.
Then with using these data sets, we trained several RBF
neural network for social network users personality
classification and with using 10 fold cross validation, we
evaluate neural networks effectiveness. Results show that use
bigram model in face book user statuses are better than
trigram and unigram results. Also with regards to obtained
results, using RBF neural networks have high precision than
other classifiers for personality traits classification. High
obtained precision in five personality traits proved that with
having a sample of user's statuses in face book can be
achieved secrets about person's personality traits and as follow
his/her behavior predicting in specified situations. The main
advantage of this task, not requiring other related user's data
like, like's number, joined groups number or structural data
related to user's friend network. So existing text in user status
and text classification as a text mining application converted
to power strong tool for user's personality traits classification.
The results of this research can useful in fields such as
assisting technology, e-learning ,e-business, health care
systems or recommender systems .
References
[1] A. Kobsa, “Generic user modeling systems” User modeling and user-
adapted interaction, Vol. 11,No. 1-2 , pp. 49-63, 2001.
[2] A. Ortigosa, R. M. Carro and J. I. Quiroga, ‘‘Predicting user personality
by mining social interactions in Facebook’’ Journal of Computer and
System Sciences, Vol. 80 , pp. 57-71, 2014.
[3] L. R. Goldberg and T. K. Rosolack,The Developing Structure of
Temperament and Personality from Infancy to Adulthood, Chapter The
Big Five Factor Structure as an Integrative Framework: An Empirical
Comparison with Eysenck’s P-E N Model, Erlbaum, New York, 1994.
[4] A. Aluja, J. Rossier, L. F. García, A. Angleitner, M. Kuhlman and M.
Zuckerman, ‘‘A cross-cultural shortened form of the ZKPQ (ZKPQ-50-
cc) adapted to English, French, German, and Spanish
languages’’ Personality and Individual Differences, Vol. 41,No. 4, pp.
619-628, 2006.
[5] M. D. Back, J. M. Stopfer, S. Vazire, S. Gaddis, S. C. Schmukle, B.
Egloff and S. D. Gosling. ‘‘Facebook profiles reflect actual personality,
not self-idealization’’. Psychological science, 2010
[6] L. Sorensen, ‘‘User managed trust in social networking-Comparing
Facebook, MySpace and Linkedin,’’ In Wireless Communication,
Vehicular Technology, Information Theory and Aerospace & Electronic
Systems Technology, 2009. Wireless VITAE 2009. 1st International
Conference on ,IEEE ,pp. 427-431, 2009.
[7] F. Liu, and L. Xiong, ‘‘Survey on text clustering algorithm,’’ In
Software Engineering and Service Science (ICSESS), 2011 IEEE 2nd
International Conference on, IEEE, pp. 901-904, 2011.
[8] J. Oberlander, and S. Nowson, ‘‘Whose thumb is it anyway?:
classifying author personality from weblog text’’. In Proceedings of the
COLING/ACL on Main conference poster sessions , Association for
Computational Linguistics, pp. 627-634, 2006.
[9] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore, ‘‘Using
Linguistic Cues for the Automatic Recognition of Personality in
5
Conversation and Text’’ J. Artif. Intell. Res.(JAIR), 30,pp. 457-500,
2007.
[10] F. Iacobelli, A. J. Gill, S. Nowson, J. Oberlander, ‘‘Large scale
personality classification of bloggers’’ In Affective Computing and
Intelligent Interaction . Springer Berlin Heidelberg, pp . 568-577 2011.
[11] J. Golbeck, C. Robles and K. Turner, ‘‘Predicting Personality with
Social Media’’ In CHI'11 Extended Abstracts on Human Factors in
Computing Systems.ACM pp. 253-262,2011.
[12] M. Kosinski, D. Stillwell, T. Graepel, ‘‘Private traits and attributes are
predictable from digital records of human behavior,’’In Proceedings of
the National Academy of Sciences, Vol. 110,No. 15, pp.5802-5805,
2013.
[13] D. chapsky, ‘‘Leveraging Online Social Networks and External Data
Sources to Pridict Personality,’’ In Advances in Social Networks
Analysis and Mining (ASONAM), 2011 International Conference on, pp.
428-433. IEEE, 2011.
[14] J. W. Pennebaker, M. E. Francis, and R. J. Booth, 2001 Inquiry and
Word Count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71,
2001.
[15] F. Sebastiani, ‘‘Machine learning in automated text
categorization’’. ACM computing surveys (CSUR), Vol. 34 No. 1, pp. 1-
47 2002.
[16] B. Carpenter “Scaling High-Order Character Language Models to
Gigabytes,” In: Proceedings of the 2005 Association for Computational
Linguistics Software Workshop, pp. 1–14,2005.
[17] I. H. Witten, and E. Frank, Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann. 2005
[18] M. J. Orr, Introduction to radial basis function networks, 1996.
[19] J. Park, and I.W. Sandberg, ‘‘Approximation and Radial-Basis-Function
Networks’’, Neural Computation., Vol. 5,No. 2, PP. 305-316, 1993.

More Related Content

Similar to 209

Survey on personality predication methods using AI
Survey on personality predication methods using AISurvey on personality predication methods using AI
Survey on personality predication methods using AI
IJAEMSJORNAL
 
User Personality Prediction on Facebook Social Media using Machine Learning
User Personality Prediction on Facebook Social Media using Machine LearningUser Personality Prediction on Facebook Social Media using Machine Learning
User Personality Prediction on Facebook Social Media using Machine Learning
ijtsrd
 
IRJET- Personality Recognition using Social Media Data
IRJET- Personality Recognition using Social Media DataIRJET- Personality Recognition using Social Media Data
IRJET- Personality Recognition using Social Media Data
IRJET Journal
 
A Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataA Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media Data
IOSR Journals
 
O017148084
O017148084O017148084
O017148084
IOSR Journals
 
Ijcatr04061001
Ijcatr04061001Ijcatr04061001
Ijcatr04061001
Editor IJCATR
 
NLP journal paper
NLP journal paperNLP journal paper
NLP journal paper
Imranul Kabir Chowdhury
 
Essay Writing Skill
Essay Writing SkillEssay Writing Skill
Essay Writing Skill
Kristen Lee
 
A Guide to Social Network Analysis
A Guide to Social Network AnalysisA Guide to Social Network Analysis
A Guide to Social Network Analysis
Olivier Serrat
 
Approach for Enneagram personality detection for Twitter text: a case study
Approach for Enneagram personality detection for Twitter text: a case studyApproach for Enneagram personality detection for Twitter text: a case study
Approach for Enneagram personality detection for Twitter text: a case study
IJECEIAES
 
Problem statement-1-friend-affinity-finder
Problem statement-1-friend-affinity-finderProblem statement-1-friend-affinity-finder
Problem statement-1-friend-affinity-finder
AmitabhDas22
 
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docxRunning head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
healdkathaleen
 
Review on Opinion Mining for Fully Fledged System
Review on Opinion Mining for Fully Fledged SystemReview on Opinion Mining for Fully Fledged System
Review on Opinion Mining for Fully Fledged System
ijeei-iaes
 
Current trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networksCurrent trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networks
eSAT Publishing House
 
customer behavior analysis for social media
customer behavior analysis for social mediacustomer behavior analysis for social media
customer behavior analysis for social media
INFOGAIN PUBLICATION
 
Big five personality prediction based in Indonesian tweets using machine lea...
Big five personality prediction based in Indonesian tweets using  machine lea...Big five personality prediction based in Indonesian tweets using  machine lea...
Big five personality prediction based in Indonesian tweets using machine lea...
IJECEIAES
 
Social Networking Facebook My Space
Social Networking Facebook My SpaceSocial Networking Facebook My Space
Social Networking Facebook My Space
annesunita
 
a modified weight balanced algorithm for influential users community detectio...
a modified weight balanced algorithm for influential users community detectio...a modified weight balanced algorithm for influential users community detectio...
a modified weight balanced algorithm for influential users community detectio...
INFOGAIN PUBLICATION
 
THE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIA
THE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIATHE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIA
THE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIA
IJCSES Journal
 
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEWSENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
Journal For Research
 

Similar to 209 (20)

Survey on personality predication methods using AI
Survey on personality predication methods using AISurvey on personality predication methods using AI
Survey on personality predication methods using AI
 
User Personality Prediction on Facebook Social Media using Machine Learning
User Personality Prediction on Facebook Social Media using Machine LearningUser Personality Prediction on Facebook Social Media using Machine Learning
User Personality Prediction on Facebook Social Media using Machine Learning
 
IRJET- Personality Recognition using Social Media Data
IRJET- Personality Recognition using Social Media DataIRJET- Personality Recognition using Social Media Data
IRJET- Personality Recognition using Social Media Data
 
A Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataA Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media Data
 
O017148084
O017148084O017148084
O017148084
 
Ijcatr04061001
Ijcatr04061001Ijcatr04061001
Ijcatr04061001
 
NLP journal paper
NLP journal paperNLP journal paper
NLP journal paper
 
Essay Writing Skill
Essay Writing SkillEssay Writing Skill
Essay Writing Skill
 
A Guide to Social Network Analysis
A Guide to Social Network AnalysisA Guide to Social Network Analysis
A Guide to Social Network Analysis
 
Approach for Enneagram personality detection for Twitter text: a case study
Approach for Enneagram personality detection for Twitter text: a case studyApproach for Enneagram personality detection for Twitter text: a case study
Approach for Enneagram personality detection for Twitter text: a case study
 
Problem statement-1-friend-affinity-finder
Problem statement-1-friend-affinity-finderProblem statement-1-friend-affinity-finder
Problem statement-1-friend-affinity-finder
 
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docxRunning head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
 
Review on Opinion Mining for Fully Fledged System
Review on Opinion Mining for Fully Fledged SystemReview on Opinion Mining for Fully Fledged System
Review on Opinion Mining for Fully Fledged System
 
Current trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networksCurrent trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networks
 
customer behavior analysis for social media
customer behavior analysis for social mediacustomer behavior analysis for social media
customer behavior analysis for social media
 
Big five personality prediction based in Indonesian tweets using machine lea...
Big five personality prediction based in Indonesian tweets using  machine lea...Big five personality prediction based in Indonesian tweets using  machine lea...
Big five personality prediction based in Indonesian tweets using machine lea...
 
Social Networking Facebook My Space
Social Networking Facebook My SpaceSocial Networking Facebook My Space
Social Networking Facebook My Space
 
a modified weight balanced algorithm for influential users community detectio...
a modified weight balanced algorithm for influential users community detectio...a modified weight balanced algorithm for influential users community detectio...
a modified weight balanced algorithm for influential users community detectio...
 
THE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIA
THE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIATHE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIA
THE SURVEY OF SENTIMENT AND OPINION MINING FOR BEHAVIOR ANALYSIS OF SOCIAL MEDIA
 
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEWSENTIMENT ANALYSIS-AN OBJECTIVE VIEW
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
 

209

  • 1. 1 Text mining Online Social Networks for Personality Classification Farzad Golnoori1 , Mohammad Karim Sohraby2 , and Farzin Yaghmaei3 1 Department of Computer Engineering Science and Research Branch, Islamic Azad university Semnan,Iran , farzadgolnoori@yahoo.com 2 Department of Computer Engineering Science and Research Branch, Islamic Azad university Semnan,Iran , Amir_sohraby@yahoo.com 3 Department of Electrical and Computer Engineering Semnan University, Semnan,Iran , f_yaghmaee@semnan.ac.ir Abstract: Today's online social networks are one the major application programs among internet users. The user of these networks daily expresses their tastes, interest and feelings in these networks. Among these, shared texts by users can be important and rich sources for investigating current user's behaviour and personality traits in these networks. In this study we investigated existing text in social networks, one of the existing diverse data in online social network in order to classify user's personality traits. For this purpose, we built great corpus about 9900 status update, related to 250 existing user in face book social network, then with using this source, different datasets built according to extractive traits from text and finally we used RBF neural network for classify user's personality traits .The results show that for personality traits classification ,RBF neural network have high precision than common classification such as SVM, Naïve Bayes. Keywords: Online social network, Text mining, Personality, User modeling. 1. Introduction Today with growing use of Internet and web application program, achieving information about user's behavior in web, in applications which the user's modeling play's critical role like recommender systems, personalized systems, or in applications like targeted marketing, have significant importance. User's model Based on the system application can be based on personal information, like user's name and age, skills, knowledge, programs and purposes, preferences, disaffection or information about user's behavior and personality[1]. in this area the user's personality is one of the interesting features. The personality of an individual can be defined as a set of features that induces a tendency on the behavior of the individual.this tendency is stable through time and situations .Knowing the personality of a given person provides hints about how he would probably react when facing different situations [2]. Research in the psychology literature has led to a well established model for personality recognition and description, called the Big Five Personality Model[3]. Five traits can be summarized in the following way:  Extraversion measures a tendency to seek stimulation in the external world, the company of others, and to express positive emotions. Extroverts tend to be more outgoing, friendly, and socially active. They are usually energetic and talkative; they do not mind being at the center of attention, and make new friends more easily. Introverts are more likely to be solitary or reserved and seek environments characterized by lower levels of external stimulation.  Conscientiousness measures preference for an organized approach to life in contrast to a spontaneous one. Conscientious people are more likely to be well organized, reliable, and consistent. They enjoy planning, seek achievements, and pursue long-term goals. Non- conscientious individuals are generally more easy-going, spontaneous, and creative. They tend to be more tolerant and less bound by rules and plans.  Openness to experience (Openness) is related to imagination, creativity, curiosity, tolerance, political liberalism, and appreciation for culture. People scoring high on Openness like change, appreciate new and unusual ideas, and have a good sense of aesthetics.  Agreeableness relates to a focus on maintaining positive social relations, being friendly, compassionate, and cooperative. Agreeable people tend to trust others and adapt to their needs. Disagreeable people are more focused on themselves, less likely to compromise, and may be less gullible. They also tend to be less bound by social expectations and conventions, and more assertive.  Neuroticism (reversely referred to as Emotional Stability) measures the tendency to experience mood swings and emotions such as guilt, anger, anx-iety, and depression. Emotionally unstable (neurotic) people are more likely to experience stress and nervousness, while emotionally stable people (low Neuroticism) tend to be calmer and self- confident. The most commonly used procedure to obtain this information consists of asking the user to fill in questionnaires. However, users can find this task too time- consuming, since most of the personality questionnaires include many questions to answer in order to obtain an accurate user profile [2],[3]. Today's social network like face book are rich sources from text in different forms. Users in face book can be updated status, shared comment on it's friends wall or shared comment
  • 2. 2 on other's user post. In this area one of the most popular features used in face book is user status, which can be said this capability are small blogs for describing person's views, feeling, beliefs and behavior. So user status potentially containing information about person's personality in facebook[4]. However, in social networking websites, people generally use unstructured or semi-structured language for communication. In everyday life conversation, people do not care about the spellings and accurate grammatical construction of a sentence that may leads to different types of ambiguities, such as lexical, syntactic, and semantic [6].Therefore, extracting logical patterns with accurate information from such unstructured form is a critical task to perform.Text mining can be a solution of above mentioned problems. Text mining refer to textual data analysis by machine learning technique, intelligence information recovery, natural language processing or other's related methods to extract and discover knowledge from text [6].On the other hand, with respect to that face book and other social network in recent years, set many laws in order to maintain user's privacy, in this area text can be as one achievable sources than other used data in online social networks. main purpose of this study, is using existing text in social networks and investigating the power of extractable features from them, without using another kinds of information about user, like related information to user's use of social network (the number of status, the number of joined groups, the number of Likes) or structural information related to user's egocentric like number of friend or criteria such as betweenness and density in order to classify personality traits. The main question is whether with having special user's status's sample in face book and or user's tweets in twitter can be achieved to useful information about user's personality . 2. Related Works In recent years there have been many different attempts to automatically classify personality traits from text or from other cues, like social network usage.In [8] classified extraversion, stability, agreeableness and conscientiousness of blog authors using n-grams as features and Naive Bayes (NB) as learning algorithm. They reported that binary classes and automatic feature selection yield the best improvement over the baseline.In [9] ran personality recognition in both conversation (using observer judjements) and text (using self assessments via Big5). They exploited two lexical resources as features, LIWC and MRC , and predicted both personality scores and classes using Support Vector Machines (SVMs) and M5 trees respectively. They also reported a long list of correlations between Big5 personality traits and two lexical resources they used.In [10] used as features word n-grams extracted from a large corpus of blogs, testing different extraction settings, such as the presence/ absence of stop words or inverse document frequency.They found that bigrams, treated as boolean features and keeping stop words, yield very good results using SVMs as learning algorithm, although the features extracted are few in a very large corpus. As for the extraction of personality recognition from social network sites [2] with using related parameters to face book social network users activity like the friends number, posts number in last month, the months number that user begin his/her activity in face book and with using Decision tree algorithm, personality trait classifier built in two case, 3-class and 5-class which 3- class case (low, high,middle) with 70% accuracy for all of personality traits reports having higher precision. According to this issue which in many studies, textual data correlation with personality traits are proved, in some works existing texts in user's profile or existing texts in posts and tweets, beside other existing traits in social networks like structural information, personal information, behavioral information [11], user's interests and preferences (the matrix of user's likes) [12], cultural information, information about the person's living place (like Ethnicity distribution, the average house's price, average income) [13], viewed as a tool for personality extraction. In most of these works used text's analysis tools Like LIWC [14] for desired features extraction. This software measured predefined categories of words usage in all over the text. According to our knowledge, the following study is the first task to use text alone, and text mining method as power tool for predicting user personality traits in online social networks fields. In this regard according to other's text classification which have two major stage first, better predictable features extracted from text and then with using machine learning Algorithm, documents (user's status set) are classified. 3. Methodology One of the Major application of text mining is text classification. Text classification assign a document to a predefined category of documents. Particularly, if we have set of labeled documents from data set D={d1,d2,...,dn} belonging to the set of categories C={c1, c2,..., cp}, Text classification duty is training classifiers with using these documents and assigning new (not observed) documents to specified categories [15]. In this work, we used about 9900 status update, related to 250 face book user, collected with my personality project[12] to evaluate methods. We turn all of the sent statuses by user to a similar text for per user, with this work, for any user who is in data set, we have one text containing all of the user's posted statuses in the dataset.In this work text classification duty is assigning text set ( user's statuses), to low or high category for each user personality trait. our approach is performed through the following main steps. 3.1 Preprocessing Preprocess phase ,prepared statuses for classification procedure, which in these, labels and stop words are omitted then stemming to the rest of text perform in document. The stop words are words that do not add meaningful
  • 3. 3 content to the data set (i.e., pronouns, prepositions, conjunctions, etc). Consequently, removing them reduces, significantly, the space of the items in the training and testing texts, and simplifies the targeted analysis. Stemming is the process of removing prefixes and suffixes leaving the stem or the root of the considered words. 3.2 Feature Extraction Textual documents must be displayed in the way that classifier able to interpret them. The two main approaches of text representation are the Bag-of-Words Model(BOW) and the Vector Space Model (VSM)[15]. In BOW model,each word is represented as a separate variable having numeric weight. VSM is now recognized as the best text representation model. Its basic idea is to represent the document as a presence vector in which feature term is weighted as component.Term's weight can be binary or Decimal .In the case of binary, 0 used to show absence of term and 1 used to indicate the presence of term in desired document. When the weights are non-binary, weights calculated With statistical and probabilistic techniques. One of the most popular term's weight calculation functions, is tf*idf [14]. This method viewed frequency of one word in one document against it's frequency in all of the documents set .One of the main steps of feature extraction is n-gram conversion [16]. The n-gram conversion consists of extracting a bag-of words representation of the text’s field .In this work we used unigram ,bigram,trigram as features. 3.3 Feature Selection The next step is selecting suitable features spaces among terms in document, which this stage is vital stage in this process and system's precision have high dependency to selected keys which indicate document.we used one feature selection method based on filter and using information Gain Ranking Criteria for selecting features with more capability of prediction. 3.4 Classification In this study we used RBF neural network for classifying user's statuses .Radial Basis Function (RBF networks) is the artificial neural network type for application of supervised learning problem [17]. By using RBF networks, the training of networks is relatively fast due to the simple structure of RBF networks. Other than that, RBF networks are also capable of universal approximation with non-restrictive assumptions [19]. The RBF networks can be implemented in any types of model whether linear on non-linear and in any kind of network whether single or multilayer [18] Generally, for neural network training, documents divided to train and test document which train document used for training system and test documents for evaluating system. Due to data set smallness, instead of data dividing to two train and test part, we used 10 fold cross validation for measuring effectiveness of the neural network. In this method, data set divided to 10 subset and each time the analysis performed on one set while the rest of data play training data role. precision will be equal to mean resulted precision in this 10 stage . 4. Experiments And Results In this work, for each personality trait, three data set built with using unigram , bigram and trigram and binary weighting. All of this procedure performed in Weka toolkit [17] and with using string to vector filter . For stemming used snow ball and for token determining used NGram Tokenizer.To build RBF neural network we used RBF implementation in Weka toolkit. For each personality trait, three RBF neural network built with using three data set. In order to evaluate method, we compared obtained precision and recall from applying RBF text classifier with other popular classifiers in text classification field like SVM, Naïve Bayes.Two criteria definition are follow: ba a ecision  Pr (1) ca a call  Re (2) a= the number of texts assigned to one category correctly b= the number of texts that assigned to one category incorrectly c= the number of texts that reject from one category incorrectly According to indicated results in below tables, it is observed that for Each five personality trait in using bigram model(b2) than unigram(b1) and trigram(b3), model's precision increased. Also in all of experience cases the using RBF neural network as a classifier have high precision than Naïve Bayes and SVM classifiers. Better obtained precision for extraversion personality trait equal to .945, for neuroticism personality trait equal to .931, for agreeableness personality trait equal to .894, for Conscientiousness personality trait equal to .949 and for openness personality trait equal to .931 which among the hardest personality trait for classifier based on RBF was agreeableness personality trait and the easiest trait was Conscientiousness. The results show that using trigram for openness and extraversion personality trait have a better effectiveness than unigram, while about Conscientiousness, agreeableness, neuroticism, unigram have a better effectiveness than trigram. Also about comparing precision and recall of two SVM, Naive Bayse classifier, using Naïve Bayes for classifying openness (Naive Bayes- b3), Agreeableness (Naive Bayes-b2), neuroticism (Naive Bayes-b1), extraversion (Naive Bayes-b2) have better effectiveness than using SVM, while about Conscientiousness personality trait, using SVM(SVM-b2) have a better effectiveness than Naïve Bayes.
  • 4. 4 TABLE I. EXTRAVERSION'S CLASSIFYING RESULTS Classifier-Feature Precision Recall RBF-b1 0.893 0.888 RBF-b2 0.945 0.94 RBF-b3 0.905 0.892 SVM-b1 0.804 0.8 SVM-b2 0.846 0.82 SVM-b3 0.826 0.772 NaiveBayes-b1 0.813 0.808 NaiveBayes-b2 0.854 0.844 NaiveBayes-b3 0.819 0.8 TABLE II. NEUROTICISM'S CLASSIFYING RESULTS Classifier-Feature Precision Recall RBF-b1 0.903 0.872 RBF-b2 0.931 0.916 RBF-b3 0.854 0.768 SVM-b1 0.764 0.748 SVM-b2 0.856 0.82 SVM-b3 0.814 0.732 NaiveBayes-b1 0.878 0.856 NaiveBayes-b2 0.863 0.832 NaiveBayes-b3 0.827 0.772 TABLE III. AGREEABLENESS'S CLASSIFYING RESULTS Classifier-Feature Precision Recall RBF-b1 0.88 0.852 RBF-b2 0.894 0.868 RBF-b3 0.859 0.808 SVM-b1 0.758 0.748 SVM-b2 0.847 0.796 SVM-b3 0.808 0.7 NaiveBayes-b1 0.813 0.784 NaiveBayes-b2 0.882 0.864 NaiveBayes-b3 0.728 0.728 TABLE IV. Conscientiousness's CLASSIFYING Results Classifier-Feature Precision Recall RBF-b1 0.932 0.924 RBF-b2 0.949 0.944 RBF-b3 0.887 0.856 SVM-b1 0.802 0.78 SVM-b2 0.834 0.796 SVM-b3 0.816 0.748 NaiveBayes-b1 0.777 0.764 NaiveBayes-b2 0.805 0.784 NaiveBayes-b3 0.821 0.776 TABLE V. OPENNESS'S CLASSIFYING RESULTS Classifier-Feature Precision Recall RBF-b1 0.912 0.9 RBF-b2 0.931 0.924 RBF-b3 0.925 0.916 SVM-b1 0.834 0.816 SVM-b2 0.832 0.796 SVM-b3 0.837 0.788 NaiveBayes-b1 0.85 0.772 NaiveBayes-b2 0.861 0.836 NaiveBayes-b3 0.875 0.848 5. Conclusion In this study we explored existing text in social Network and text mining methods as a tool for facebook users's personality traits classification. Our main purpose was to investigate relationship among special words with social network user's personality traits. We build three data set according to text indicative terms (unigram, bigram, trigram) and terms scoring (binary) for each personality trait. Then with using these data sets, we trained several RBF neural network for social network users personality classification and with using 10 fold cross validation, we evaluate neural networks effectiveness. Results show that use bigram model in face book user statuses are better than trigram and unigram results. Also with regards to obtained results, using RBF neural networks have high precision than other classifiers for personality traits classification. High obtained precision in five personality traits proved that with having a sample of user's statuses in face book can be achieved secrets about person's personality traits and as follow his/her behavior predicting in specified situations. The main advantage of this task, not requiring other related user's data like, like's number, joined groups number or structural data related to user's friend network. So existing text in user status and text classification as a text mining application converted to power strong tool for user's personality traits classification. The results of this research can useful in fields such as assisting technology, e-learning ,e-business, health care systems or recommender systems . References [1] A. Kobsa, “Generic user modeling systems” User modeling and user- adapted interaction, Vol. 11,No. 1-2 , pp. 49-63, 2001. [2] A. Ortigosa, R. M. Carro and J. I. Quiroga, ‘‘Predicting user personality by mining social interactions in Facebook’’ Journal of Computer and System Sciences, Vol. 80 , pp. 57-71, 2014. [3] L. R. Goldberg and T. K. Rosolack,The Developing Structure of Temperament and Personality from Infancy to Adulthood, Chapter The Big Five Factor Structure as an Integrative Framework: An Empirical Comparison with Eysenck’s P-E N Model, Erlbaum, New York, 1994. [4] A. Aluja, J. Rossier, L. F. García, A. Angleitner, M. Kuhlman and M. Zuckerman, ‘‘A cross-cultural shortened form of the ZKPQ (ZKPQ-50- cc) adapted to English, French, German, and Spanish languages’’ Personality and Individual Differences, Vol. 41,No. 4, pp. 619-628, 2006. [5] M. D. Back, J. M. Stopfer, S. Vazire, S. Gaddis, S. C. Schmukle, B. Egloff and S. D. Gosling. ‘‘Facebook profiles reflect actual personality, not self-idealization’’. Psychological science, 2010 [6] L. Sorensen, ‘‘User managed trust in social networking-Comparing Facebook, MySpace and Linkedin,’’ In Wireless Communication, Vehicular Technology, Information Theory and Aerospace & Electronic Systems Technology, 2009. Wireless VITAE 2009. 1st International Conference on ,IEEE ,pp. 427-431, 2009. [7] F. Liu, and L. Xiong, ‘‘Survey on text clustering algorithm,’’ In Software Engineering and Service Science (ICSESS), 2011 IEEE 2nd International Conference on, IEEE, pp. 901-904, 2011. [8] J. Oberlander, and S. Nowson, ‘‘Whose thumb is it anyway?: classifying author personality from weblog text’’. In Proceedings of the COLING/ACL on Main conference poster sessions , Association for Computational Linguistics, pp. 627-634, 2006. [9] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore, ‘‘Using Linguistic Cues for the Automatic Recognition of Personality in
  • 5. 5 Conversation and Text’’ J. Artif. Intell. Res.(JAIR), 30,pp. 457-500, 2007. [10] F. Iacobelli, A. J. Gill, S. Nowson, J. Oberlander, ‘‘Large scale personality classification of bloggers’’ In Affective Computing and Intelligent Interaction . Springer Berlin Heidelberg, pp . 568-577 2011. [11] J. Golbeck, C. Robles and K. Turner, ‘‘Predicting Personality with Social Media’’ In CHI'11 Extended Abstracts on Human Factors in Computing Systems.ACM pp. 253-262,2011. [12] M. Kosinski, D. Stillwell, T. Graepel, ‘‘Private traits and attributes are predictable from digital records of human behavior,’’In Proceedings of the National Academy of Sciences, Vol. 110,No. 15, pp.5802-5805, 2013. [13] D. chapsky, ‘‘Leveraging Online Social Networks and External Data Sources to Pridict Personality,’’ In Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference on, pp. 428-433. IEEE, 2011. [14] J. W. Pennebaker, M. E. Francis, and R. J. Booth, 2001 Inquiry and Word Count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71, 2001. [15] F. Sebastiani, ‘‘Machine learning in automated text categorization’’. ACM computing surveys (CSUR), Vol. 34 No. 1, pp. 1- 47 2002. [16] B. Carpenter “Scaling High-Order Character Language Models to Gigabytes,” In: Proceedings of the 2005 Association for Computational Linguistics Software Workshop, pp. 1–14,2005. [17] I. H. Witten, and E. Frank, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. 2005 [18] M. J. Orr, Introduction to radial basis function networks, 1996. [19] J. Park, and I.W. Sandberg, ‘‘Approximation and Radial-Basis-Function Networks’’, Neural Computation., Vol. 5,No. 2, PP. 305-316, 1993.