209

1
Text mining Online Social Networks for Personality Classification
Farzad Golnoori1
, Mohammad Karim Sohraby2
, and Farzin Yaghmaei3
1
Department of Computer Engineering Science and Research Branch, Islamic Azad university Semnan,Iran
, farzadgolnoori@yahoo.com
2
Department of Computer Engineering Science and Research Branch, Islamic Azad university Semnan,Iran
, Amir_sohraby@yahoo.com
3
Department of Electrical and Computer Engineering Semnan University, Semnan,Iran
, f_yaghmaee@semnan.ac.ir
Abstract: Today's online social networks are one the major application programs among internet users. The user of
these networks daily expresses their tastes, interest and feelings in these networks. Among these, shared texts by users
can be important and rich sources for investigating current user's behaviour and personality traits in these networks. In
this study we investigated existing text in social networks, one of the existing diverse data in online social network in
order to classify user's personality traits. For this purpose, we built great corpus about 9900 status update, related to 250
existing user in face book social network, then with using this source, different datasets built according to extractive
traits from text and finally we used RBF neural network for classify user's personality traits .The results show that for
personality traits classification ,RBF neural network have high precision than common classification such as SVM,
Naïve Bayes.
Keywords: Online social network, Text mining, Personality, User modeling.
1. Introduction
Today with growing use of Internet and web application
program, achieving information about user's behavior in web,
in applications which the user's modeling play's critical role
like recommender systems, personalized systems, or in
applications like targeted marketing, have significant
importance. User's model Based on the system application can
be based on personal information, like user's name and age,
skills, knowledge, programs and purposes, preferences,
disaffection or information about user's behavior and
personality[1]. in this area the user's personality is one of the
interesting features.
The personality of an individual can be defined as a set of
features that induces a tendency on the behavior of the
individual.this tendency is stable through time and situations
.Knowing the personality of a given person provides hints
about how he would probably react when facing different
situations [2]. Research in the psychology literature has led to a
well established model for personality recognition and
description, called the Big Five Personality Model[3]. Five
traits can be summarized in the following way:
 Extraversion measures a tendency to seek stimulation in
the external world, the company of others, and to express
positive emotions. Extroverts tend to be more outgoing,
friendly, and socially active. They are usually energetic and
talkative; they do not mind being at the center of attention,
and make new friends more easily. Introverts are more
likely to be solitary or reserved and seek environments
characterized by lower levels of external stimulation.
 Conscientiousness measures preference for an organized
approach to life in contrast to a spontaneous one.
Conscientious people are more likely to be well organized,
reliable, and consistent. They enjoy planning, seek
achievements, and pursue long-term goals. Non-
conscientious individuals are generally more easy-going,
spontaneous, and creative. They tend to be more tolerant
and less bound by rules and plans.
 Openness to experience (Openness) is related to
imagination, creativity, curiosity, tolerance, political
liberalism, and appreciation for culture. People scoring high
on Openness like change, appreciate new and unusual
ideas, and have a good sense of aesthetics.
 Agreeableness relates to a focus on maintaining positive
social relations, being friendly, compassionate, and
cooperative. Agreeable people tend to trust others and adapt
to their needs. Disagreeable people are more focused on
themselves, less likely to compromise, and may be less
gullible. They also tend to be less bound by social
expectations and conventions, and more assertive.
 Neuroticism (reversely referred to as Emotional Stability)
measures the tendency to experience mood swings and
emotions such as guilt, anger, anx-iety, and depression.
Emotionally unstable (neurotic) people are more likely to
experience stress and nervousness, while emotionally stable
people (low Neuroticism) tend to be calmer and self-
confident.
The most commonly used procedure to obtain this
information consists of asking the user to fill in
questionnaires. However, users can find this task too time-
consuming, since most of the personality questionnaires
include many questions to answer in order to obtain an
accurate user profile [2],[3].
Today's social network like face book are rich sources from
text in different forms. Users in face book can be updated
status, shared comment on it's friends wall or shared comment

2
on other's user post. In this area one of the most popular
features used in face book is user status, which can be said this
capability are small blogs for describing person's views,
feeling, beliefs and behavior. So user status potentially
containing information about person's personality in
facebook[4].
However, in social networking websites, people generally
use unstructured or semi-structured language for
communication. In everyday life conversation, people do not
care about the spellings and accurate grammatical construction
of a sentence that may leads to different types of ambiguities,
such as lexical, syntactic, and semantic [6].Therefore,
extracting logical patterns with accurate information from
such unstructured form is a critical task to perform.Text
mining can be a solution of above mentioned problems.
Text mining refer to textual data analysis by machine
learning technique, intelligence information recovery, natural
language processing or other's related methods to extract and
discover knowledge from text [6].On the other hand, with
respect to that face book and other social network in recent
years, set many laws in order to maintain user's privacy, in this
area text can be as one achievable sources than other used data
in online social networks. main purpose of this study, is using
existing text in social networks and investigating the power of
extractable features from them, without using another kinds of
information about user, like related information to user's use
of social network (the number of status, the number of joined
groups, the number of Likes) or structural information related
to user's egocentric like number of friend or criteria such as
betweenness and density in order to classify personality traits.
The main question is whether with having special user's
status's sample in face book and or user's tweets in twitter can
be achieved to useful information about user's personality .
2. Related Works
In recent years there have been many different attempts to
automatically classify personality traits from text or from
other cues, like social network usage.In [8] classified
extraversion, stability, agreeableness and conscientiousness of
blog authors using n-grams as features and Naive Bayes (NB)
as learning algorithm. They reported that binary classes and
automatic feature selection yield the best improvement over
the baseline.In [9] ran personality recognition in both
conversation (using observer judjements) and text (using self
assessments via Big5). They exploited two lexical resources as
features, LIWC and MRC , and predicted both personality
scores and classes using Support Vector Machines (SVMs)
and M5 trees respectively. They also reported a long list of
correlations between Big5 personality traits and two lexical
resources they used.In [10] used as features word n-grams
extracted from a large corpus of blogs, testing different
extraction settings, such as the presence/ absence of stop
words or inverse document frequency.They found that
bigrams, treated as boolean features and keeping stop words,
yield very good results using SVMs as learning algorithm,
although the features extracted are few in a very large corpus.
As for the extraction of personality recognition from social
network sites [2] with using related parameters to face book
social network users activity like the friends number, posts
number in last month, the months number that user begin
his/her activity in face book and with using Decision tree
algorithm, personality trait classifier built in two case, 3-class
and 5-class which 3- class case (low, high,middle) with 70%
accuracy for all of personality traits reports having higher
precision. According to this issue which in many studies,
textual data correlation with personality traits are proved, in
some works existing texts in user's profile or existing texts in
posts and tweets, beside other existing traits in social
networks like structural information, personal information,
behavioral information [11], user's interests and preferences
(the matrix of user's likes) [12], cultural information,
information about the person's living place (like Ethnicity
distribution, the average house's price, average income) [13],
viewed as a tool for personality extraction.
In most of these works used text's analysis tools Like
LIWC [14] for desired features extraction. This software
measured predefined categories of words usage in all over the
text. According to our knowledge, the following study is the
first task to use text alone, and text mining method as power
tool for predicting user personality traits in online social
networks fields. In this regard according to other's text
classification which have two major stage first, better
predictable features extracted from text and then with using
machine learning Algorithm, documents (user's status set) are
classified.
3. Methodology
One of the Major application of text mining is text
classification. Text classification assign a document to a
predefined category of documents.
Particularly, if we have set of labeled documents from data
set D={d1,d2,...,dn} belonging to the set of categories
C={c1, c2,..., cp}, Text classification duty is training classifiers
with using these documents and assigning new (not observed)
documents to specified categories [15]. In this work, we used
about 9900 status update, related to 250 face book user,
collected with my personality project[12] to evaluate methods.
We turn all of the sent statuses by user to a similar text for per
user, with this work, for any user who is in data set, we have
one text containing all of the user's posted statuses in the
dataset.In this work text classification duty is assigning text
set ( user's statuses), to low or high category for each user
personality trait. our approach is performed through the
following main steps.
3.1 Preprocessing
Preprocess phase ,prepared statuses for classification
procedure, which in these, labels and stop words are omitted
then stemming to the rest of text perform in document.
The stop words are words that do not add meaningful

3
content to the data set (i.e., pronouns, prepositions,
conjunctions, etc). Consequently, removing them reduces,
significantly, the space of the items in the training and testing
texts, and simplifies the targeted analysis. Stemming is the
process of removing prefixes and suffixes leaving the stem or
the root of the considered words.
3.2 Feature Extraction
Textual documents must be displayed in the way that
classifier able to interpret them. The two main approaches of
text representation are the Bag-of-Words Model(BOW) and the
Vector Space Model (VSM)[15]. In BOW model,each word is
represented as a separate variable having numeric weight.
VSM is now recognized as the best text representation model.
Its basic idea is to represent the document as a presence vector
in which feature term is weighted as component.Term's weight
can be binary or Decimal .In the case of binary, 0 used to show
absence of term and 1 used to indicate the presence of term in
desired document.
When the weights are non-binary, weights calculated With
statistical and probabilistic techniques. One of the most popular
term's weight calculation functions, is tf*idf [14]. This method
viewed frequency of one word in one document against it's
frequency in all of the documents set .One of the main steps of
feature extraction is n-gram conversion [16]. The n-gram
conversion consists of extracting a bag-of words representation
of the text’s field .In this work we used unigram
,bigram,trigram as features.
3.3 Feature Selection
The next step is selecting suitable features spaces among
terms in document, which this stage is vital stage in this
process and system's precision have high dependency to
selected keys which indicate document.we used one feature
selection method based on filter and using information Gain
Ranking Criteria for selecting features with more capability of
prediction.
3.4 Classification
In this study we used RBF neural network for classifying
user's statuses .Radial Basis Function (RBF networks) is the
artificial neural network type for application of supervised
learning problem [17]. By using RBF networks, the training of
networks is relatively fast due to the simple structure of RBF
networks. Other than that, RBF networks are also capable of
universal approximation with non-restrictive assumptions
[19]. The RBF networks can be implemented in any types of
model whether linear on non-linear and in any kind of network
whether single or multilayer [18] Generally, for neural
network training, documents divided to train and test
document which train document used for training system and
test documents for evaluating system.
Due to data set smallness, instead of data dividing to two
train and test part, we used 10 fold cross validation for
measuring effectiveness of the neural network. In this method,
data set divided to 10 subset and each time the analysis
performed on one set while the rest of data play training data
role. precision will be equal to mean resulted precision in this
10 stage .
4. Experiments And Results
In this work, for each personality trait, three data set built
with using unigram , bigram and trigram and binary
weighting. All of this procedure performed in Weka toolkit
[17] and with using string to vector filter . For stemming used
snow ball and for token determining used NGram
Tokenizer.To build RBF neural network we used RBF
implementation in Weka toolkit. For each personality trait,
three RBF neural network built with using three data set.
In order to evaluate method, we compared obtained
precision and recall from applying RBF text classifier with
other popular classifiers in text classification field like SVM,
Naïve Bayes.Two criteria definition are follow:
ba
a
ecision

Pr (1)
ca
a
call

Re
(2)
a= the number of texts assigned to one category correctly
b= the number of texts that assigned to one category
incorrectly
c= the number of texts that reject from one category
incorrectly
According to indicated results in below tables, it is
observed that for Each five personality trait in using bigram
model(b2) than unigram(b1) and trigram(b3), model's
precision increased. Also in all of experience cases the using
RBF neural network as a classifier have high precision than
Naïve Bayes and SVM classifiers. Better obtained precision
for extraversion personality trait equal to .945, for neuroticism
personality trait equal to .931, for agreeableness personality
trait equal to .894, for Conscientiousness personality trait
equal to .949 and for openness personality trait equal to .931
which among the hardest personality trait for classifier based
on RBF was agreeableness personality trait and the easiest
trait was Conscientiousness. The results show that using
trigram for openness and extraversion personality trait have a
better effectiveness than unigram, while about
Conscientiousness, agreeableness, neuroticism, unigram have
a better effectiveness than trigram. Also about comparing
precision and recall of two SVM, Naive Bayse classifier,
using Naïve Bayes for classifying openness (Naive Bayes-
b3), Agreeableness (Naive Bayes-b2), neuroticism (Naive
Bayes-b1), extraversion (Naive Bayes-b2) have better
effectiveness than using SVM, while about Conscientiousness
personality trait, using SVM(SVM-b2) have a better
effectiveness than Naïve Bayes.

4
TABLE I. EXTRAVERSION'S CLASSIFYING RESULTS
Classifier-Feature Precision Recall
RBF-b1 0.893 0.888
RBF-b2 0.945 0.94
RBF-b3 0.905 0.892
SVM-b1 0.804 0.8
SVM-b2 0.846 0.82
SVM-b3 0.826 0.772
NaiveBayes-b1 0.813 0.808
TABLE II. NEUROTICISM'S CLASSIFYING RESULTS
RBF-b1 0.903 0.872
RBF-b2 0.931 0.916
RBF-b3 0.854 0.768
SVM-b1 0.764 0.748
SVM-b2 0.856 0.82
SVM-b3 0.814 0.732
TABLE III. AGREEABLENESS'S CLASSIFYING RESULTS
RBF-b1 0.88 0.852
RBF-b2 0.894 0.868
RBF-b3 0.859 0.808
SVM-b1 0.758 0.748
SVM-b2 0.847 0.796
SVM-b3 0.808 0.7
TABLE IV. Conscientiousness's CLASSIFYING Results
RBF-b1 0.932 0.924
RBF-b2 0.949 0.944
RBF-b3 0.887 0.856
SVM-b1 0.802 0.78
SVM-b2 0.834 0.796
SVM-b3 0.816 0.748
TABLE V. OPENNESS'S CLASSIFYING RESULTS
RBF-b1 0.912 0.9
RBF-b2 0.931 0.924
RBF-b3 0.925 0.916
SVM-b1 0.834 0.816
SVM-b2 0.832 0.796
SVM-b3 0.837 0.788
5. Conclusion
In this study we explored existing text in social Network
and text mining methods as a tool for facebook users's
personality traits classification. Our main purpose was to
investigate relationship among special words with social
network user's personality traits. We build three data set
according to text indicative terms (unigram, bigram, trigram)
and terms scoring (binary) for each personality trait.
Then with using these data sets, we trained several RBF
neural network for social network users personality
classification and with using 10 fold cross validation, we
evaluate neural networks effectiveness. Results show that use
bigram model in face book user statuses are better than
trigram and unigram results. Also with regards to obtained
results, using RBF neural networks have high precision than
other classifiers for personality traits classification. High
obtained precision in five personality traits proved that with
having a sample of user's statuses in face book can be
achieved secrets about person's personality traits and as follow
his/her behavior predicting in specified situations. The main
advantage of this task, not requiring other related user's data
like, like's number, joined groups number or structural data
related to user's friend network. So existing text in user status
and text classification as a text mining application converted
to power strong tool for user's personality traits classification.
The results of this research can useful in fields such as
assisting technology, e-learning ,e-business, health care
systems or recommender systems .
References
[1] A. Kobsa, “Generic user modeling systems” User modeling and user-
adapted interaction, Vol. 11,No. 1-2 , pp. 49-63, 2001.
[2] A. Ortigosa, R. M. Carro and J. I. Quiroga, ‘‘Predicting user personality
by mining social interactions in Facebook’’ Journal of Computer and
System Sciences, Vol. 80 , pp. 57-71, 2014.
[3] L. R. Goldberg and T. K. Rosolack,The Developing Structure of
Temperament and Personality from Infancy to Adulthood, Chapter The
Big Five Factor Structure as an Integrative Framework: An Empirical
Comparison with Eysenck’s P-E N Model, Erlbaum, New York, 1994.
[4] A. Aluja, J. Rossier, L. F. García, A. Angleitner, M. Kuhlman and M.
Zuckerman, ‘‘A cross-cultural shortened form of the ZKPQ (ZKPQ-50-
cc) adapted to English, French, German, and Spanish
languages’’ Personality and Individual Differences, Vol. 41,No. 4, pp.
619-628, 2006.
[5] M. D. Back, J. M. Stopfer, S. Vazire, S. Gaddis, S. C. Schmukle, B.
Egloff and S. D. Gosling. ‘‘Facebook profiles reflect actual personality,
not self-idealization’’. Psychological science, 2010
[6] L. Sorensen, ‘‘User managed trust in social networking-Comparing
Facebook, MySpace and Linkedin,’’ In Wireless Communication,
Vehicular Technology, Information Theory and Aerospace & Electronic
Systems Technology, 2009. Wireless VITAE 2009. 1st International
Conference on ,IEEE ,pp. 427-431, 2009.
[7] F. Liu, and L. Xiong, ‘‘Survey on text clustering algorithm,’’ In
Software Engineering and Service Science (ICSESS), 2011 IEEE 2nd
International Conference on, IEEE, pp. 901-904, 2011.
[8] J. Oberlander, and S. Nowson, ‘‘Whose thumb is it anyway?:
classifying author personality from weblog text’’. In Proceedings of the
COLING/ACL on Main conference poster sessions , Association for
Computational Linguistics, pp. 627-634, 2006.
[9] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore, ‘‘Using
Linguistic Cues for the Automatic Recognition of Personality in

5
Conversation and Text’’ J. Artif. Intell. Res.(JAIR), 30,pp. 457-500,
2007.
[10] F. Iacobelli, A. J. Gill, S. Nowson, J. Oberlander, ‘‘Large scale
personality classification of bloggers’’ In Affective Computing and
Intelligent Interaction . Springer Berlin Heidelberg, pp . 568-577 2011.
[11] J. Golbeck, C. Robles and K. Turner, ‘‘Predicting Personality with
Social Media’’ In CHI'11 Extended Abstracts on Human Factors in
Computing Systems.ACM pp. 253-262,2011.
[12] M. Kosinski, D. Stillwell, T. Graepel, ‘‘Private traits and attributes are
predictable from digital records of human behavior,’’In Proceedings of
the National Academy of Sciences, Vol. 110,No. 15, pp.5802-5805,
2013.
[13] D. chapsky, ‘‘Leveraging Online Social Networks and External Data
Sources to Pridict Personality,’’ In Advances in Social Networks
Analysis and Mining (ASONAM), 2011 International Conference on, pp.
428-433. IEEE, 2011.
[14] J. W. Pennebaker, M. E. Francis, and R. J. Booth, 2001 Inquiry and
Word Count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71,
2001.
[15] F. Sebastiani, ‘‘Machine learning in automated text
categorization’’. ACM computing surveys (CSUR), Vol. 34 No. 1, pp. 1-
47 2002.
[16] B. Carpenter “Scaling High-Order Character Language Models to
Gigabytes,” In: Proceedings of the 2005 Association for Computational
Linguistics Software Workshop, pp. 1–14,2005.
[17] I. H. Witten, and E. Frank, Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann. 2005
[18] M. J. Orr, Introduction to radial basis function networks, 1996.
[19] J. Park, and I.W. Sandberg, ‘‘Approximation and Radial-Basis-Function
Networks’’, Neural Computation., Vol. 5,No. 2, PP. 305-316, 1993.

209

Recommended

Recommended

More Related Content

Similar to 209

Similar to 209 (20)

209