User behavior model & recommendation on basis of social networks

American International University - Bangladesh
Faculty of Science and Information Technology
Department of Computer Science
User Behavior Modeling & Recommendation
System Based On Social Networks
A thesis submitted for the degree of
Bachelor of Science in Computer Science and Engineering
By:
Alam Shah
10-17685-3
Hossain, MD. Shakawat
11-18494-1
Taher, Najeeb Ahmad
11-18198-1
Supervisor:
Md. Saddam Hossain
Assistant Professor, Department of Computer Science, American
International University-Bangladesh
Summer 2014

Declaration
This is to certify that this project is our original work. No part of this has been
submitted elsewhere partially or fully for the award of any other degree. Any
material reproduced in this project has been properly acknowledged.
Alam Shah Hossain MD. Shakawat
ID: 10-17685-3 ID: 11-18494-1
Department: CSE Department: CSE
Taher, Najeeb Ahmad
ID: 11-18198-1
Department: CSE
i

Approval
The thesis titled “User Behavior Modeling & Recommendation System Based
On Social Networks” has been submitted to the following respected members of
the Board of Examiners of the Faculty of Science and Information Technology
in partial fulﬁllment of the requirements for the degree of Bachelor of Science in
Computer Science Engineering and has been accepted satisfactory.
Md. Saddam Hossain
Assistant Professor
Faculty of Computer Science
American International University-Bangladesh
Dr. Dip Nandi
Assistant Professor & Head
ii

iii
Professor Dr. Tafazzal Hossain
Dean
Dr. Carmen Z. Lamagna
Vice Chancellor
iii

Acknowledgements
Special thanks to our honorable teacher and supervisor Md. Sad-
dam Hossain, Assistant Professor, Department of Computer Science,
American International University-Bangladesh. We are very grateful
to him for giving us the opportunity to work with him. Without his
continuous support, it would be very diﬃcult for us to complete this
work. We would also like to thank all the faculty members for their
guidelines for making proper documentation for our project.

Abstract
At present social networks play an important role to express people’s
sentiment and people’s interest in a particular field. Extracting a
user’s public social network data (what the user shares with friends
and relatives and how the user reacts over others’ thought) means
extracting the user’s behavior. Defining some determined hypothesis
if we make machine understand human sentiment and interest, it is
possible to recommend a user his/her personal interest on basis of
the user’s sentiment analyzed by machine. Our main approach is to
suggest a user regarding the user’s specific interest that is anticipated
by analyzing the user’s public data. This can be extended to further
business analysis to suggest products or services of different companies
depending on the consumer’s personal choice. This automation will
also help to choose the correct candidate for any questionnaire. This
system will also help anyone to know about himself or herself, how
one’s behavior may influence others. It is possible to identify different
types of people such as- dependable people, leadership skilled, people
of supportive mentality, people of negative mentality etc.

Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : vii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : 1
2. Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : 3
2.1 Location Based Social Network. . . . . . . . . . . . . . . . . . ...: 3
2.2 Collaborative Recommendation
Based Social Network. . . . . . . . . . . . . . . . . . . . . . . . . . . ..: 8
2.3 Sentimental Intensity
Analysis of Informal Texts. . . . . . . . . . . . . . . . . . . . . . . . : 12
2.4 Big Five [1] Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . ..: 16
3. Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...: 28
4. Proposed Research Methodology. . . . . . . . . . . . . . . . . . . . . ...: 29
4.1 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...: 29
4.2 Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .: 31
4.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..: 32
4.4 Recommendation Analysis. . . . . . . . . . . . . . . . . . . . . . . . : 33
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......: 36
vi

List of Figures
4.1 Modeling User Behavior . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Pie Chart of LIWC Results . . . . . . . . . . . . . . . . . . . . . 32
4.3 Personality Based Recommendation System . . . . . . . . . . . . 33
vii

List of Tables
2.1 Comparison of diﬀerent location based social networks . . . . . . 7
4.1 Relationship between LIWC categories and Big Five factors . . . 31
4.2 Products under Big Five factors . . . . . . . . . . . . . . . . . . . 34
viii

Chapter 1
Introduction
With millions of users, social networking services like Facebook [2] and Twitter [3]
have become some of the most popular internet applications. These applications
are sources of knowledge and information. The rich knowledge that has been
accumulated in these social networking sites enables a variety of recommendation
systems for new users and media [4]. To use such opportunity, it is possible to
create automated system that can categorize social network users according to
Big Five [1] personality factors. To categorize users in such categorization system,
users’ data are needed to be collected without interfering their daily activities.
Thus the system will help people to know about other people. For example: An
employee needs vacation and if his boss is listed as a friend on OSN (Online Social
Networks) then the employee gets the chance to apply for his demand according
to the boss’s behavior determined by the system (Neuroticism [1] indicates higher
chances of disagree when Agreeableness [1] indicates higher chances of agree).
Online Social Networks (OSN) deal with big data, after analyzing such data, the
system will be able to predict a suitable person for leadership or people who may
oppose the leadership. Many challenges to recommendation systems have been
tackled by many new approaches, using different data sources and methodologies
to generate different kinds of recommendations. In this article we provide a
description of such systems.
From the very beginning, Consumer interests have a great influence on business
policy. Offering the right products or services to the right customers is the main
objective of every successful business policy. Many business organizations can
1

2
be benefited by using the data collected from OSN. At present the popularity of
social networks is increasing very rapidly. From sociologist’s points of view, OSN
can be characterized as “collective goods produced through computer mediated
collective action” [5]. Users spend a huge amount of time of their daily life
involving in OSN and share a lot of information about them and their friends
and families. So, this is a great opportunity to know about the sentiments and
the interests of the people. It is possible to understand the behavior of the users
of OSN as it becomes a crucial factor for advertising policies and better product
design.
In particular giving the success of item recommendation systems of commercial
websites, such as Amazon [6] and Netflix [7], it is considered worthwhile to revisit
the recommendation problem through the perspective of social networking. In
general, recommendation systems aim to provide personalized recommendations
of items to users based on their previous behavior as well as on other information
gathered by item descriptions and user profiles.
Our experiment is based on Twitter [3] and Facebook [2]; the most popular OSN
websites having a large place of advertisements. These websites have a very big
number of users and the users feel comfortable using these social networking sites
because of the user-friendly features of these sites such as micro-blogging, status
updating, photos and videos sharing, commenting on posts, joining and creating
groups, liking and subscribing pages and profiles, creating events, playing games
and so on.
We aim to analyze user behavior by the following steps- collecting the user’s
past activities in OSN, mapping it on Big Five factors [1], finding out a set of
particular interests field of the user and recommending him or her by giving
informative services.
2

Chapter 2
Previous work
OSN is the practice of expanding the number of business and social contacts of
a person by making connections through individuals [8]. In this era of internet
OSN is extremely popular among people. According to Nielsen Onlines report
two third of world population spent 10% of their time in internet in OSN [9]. As
OSN give opportunity to its user to express what he/she wants to say with their
friends, relatives and others connected through their OSN account. There are
huge amount of chances to identify/characterize one’s behavior types implicitly
without interfering his or her personal life [4].
2.1 Location Based Social Network [10]
A social network is a social structure made up of individuals connected by one or
more speciﬁc types of interdependency, such as friendship, common interests, and
shared knowledge. Generally, a social networking service builds on and reﬂects
the real-life social networks among people through online platforms such as a
website, providing ways for users to share ideas, activities, events, and interests
over the Internet. The increasing availability of location-acquisition technology
(for example GPS and Wi-Fi) empowers people to add a location dimension to
existing online social networks in a variety of ways. For example, users can upload
location-tagged photos to a social networking service such as Flickr [11], comment
3

2.1. Location Based Social Network [10] 4
on an event at the exact place where the event is happening (for instance, in Twit-
ter [3]), share their present location on a website (such as Foursquare [12]) for
organizing a group activity in the real world, record travel routes with GPS tra-
jectories to share travel experiences in an online community. Here, a location can
be represented in absolute (latitude-longitude coordinates), relative (100 meters
north of the Space Needle), and symbolic (home, office, or shopping mall) form.
Also, the location embedded into a social network can be a stand-alone instant
location of an individual, like in a bar at 9pm, or a location history accumulated
over a certain period, such as a GPS trajectory: a cinema a restaurant a park a
bar.
The dimension of location brings social networks back to reality, bridging the
gap between the physical world and online social networking services. For exam-
ple, a user with a mobile phone can leave his/her comments with respect to a
restaurant in an online social site (after finishing dinner) so that the people from
his/her social structure can reference his/her comments when they later visit the
restaurant. In this example, users create their own location-related stories in the
physical world and browse other peoples information as well. An online social site
becomes a platform for facilitating the sharing of peoples experiences. Further-
more, people in an existing social network can expand their social structure with
the new interdependency derived from their locations. As location is one of the
most important components of user context, extensive knowledge about an indi-
viduals interests and behavior can be learned from her locations. For instance,
people who enjoy the same restaurant can connect with each other. Individuals
constantly hiking the same mountain can be put in contact with each other to
share their travel experiences. Sometimes, two individuals who do not share the
same absolute location can still be linked as long as their locations are indicative
of a similar interest, such as beaches or lakes.
These kinds of location-embedded and location-driven social structures are known
as location-based social networks, formally defined as follows:
“A location-based social network (LBSN) [10] does not only mean adding a loca-
tion to an existing social network so that people in the social structure can share
location embedded information, but also consists of the new social structure made
up of individuals connected by the interdependency derived from their locations in
4

the physical world as well as their location-tagged media content, such as photos,
video, and texts. Here, the physical location consists of the instant location of an
individual at a given timestamp and the location history that an individual has
accumulated in a certain period. Further, the interdependency includes not only
that two persons co-occur in the same physical location or share similar location
histories but also the knowledge, e.g., common interests, behavior, and activities,
inferred from an individual’s location (history)and location-tagged data.”
In a location-based social network, people can not only track and share the
location-related information of an individual via either mobile devices or desktop
computers, but also leverage collaborative social knowledge learned from user gen-
erated and location-related content, such as GPS trajectories and geo-tagged pho-
tos. One example is determining this summers most popular restaurant by mining
peoples geo-tagged comments. Another example could be identifying the most
popular travel routes in a city based on a large number of users geo-tagged pho-
tos. Consequently, LBSNs enable many novel applications that change the way
we live, such as physical location (or activity) recommendation systems [13] [14]
and travel planning , while oﬀering many new research opportunities for social
network analysis (like user modeling in the physical world and connection strength
analysis) [15] [16], spatio-temporal data mining [17], ubiquitous computing [18],
and spatio-temporal databases [17] [19] Existing applications providing location-
based social networking services can be broadly categorized into three folds: geo-
tagged-media-based, point-location-driven and trajectory-centric.
• Geo-tagged-media-based. [10] Quite a few geo-tagging services enable users
to add a location label to media content such as text, photos, and videos
generated in the physical world. The tagging can occur instantly when
the medium is generated, or after a user has returned home. In this way,
people can browse their content at the exact location where it was created
(on a digital map or in the physical world using a mobile phone). Users can
also comment on the media and expand their social structures using the
interdependency derived from the geo-tagged content (for example, in favor
of the same photo taken at a location). Representative websites of such
location-based social networking services include Flickr, Panoramio, and
5

Geo-twitter. Though a location dimension has been added to these social
networks, the focus of such services is still on the media content. That is,
location is used only as a feature to organize and enrich media content while
the major interdependency between users is based on the media itself.
• Point-location-driven. [10] Applications like Foursquare and Google Lati-
tude encourage people to share their current locations, such as a restaurant
or a museum. In Foursquare, points and badges are awarded for checking
in at venues. The individual with the most number of check-ins at a venue
is crowned Mayor. With the real-time location of users, an individual can
discover friends (from her social network) around her physical location so as
to enable certain social activities in the physical world, e.g., inviting people
to have dinner or go shopping. Meanwhile, users can add tips to venues
that other users can read, which serve as suggestions for things to do, see,
or eat at the location. With this kind of service, a venue (point location) is
the main element determining the in-terdependency connecting users, while
user-generated content such as tips and badges feature a point location.
• Trajectory-centric. [10] In a trajectory-centric social networking service,
such as Bikely, SportsDo, and Microsoft GeoLife, users pay attention to
both point locations (passed by a trajectory) and the detailed route con-
necting these point locations. These services do not only tell users basic
information, such as distance, duration, and velocity, about a particular
trajectory, but also show a users experiences represented by tags, tips, and
photos for the trajectory. In short, these services provide how and what
information in addition to where and when. In this way, other people can
reference a users travel/sports experience by browsing or replaying the tra-
jectory on a digital map, and follow the trajectory in the real world with a
GPS-phone.
6

Table 2.1 provides a brief comparison among the set here services. The major
differences between the point-location-driven and the trajectory-centric LBSN lie
in two aspects. One is that a trajectory offers richer information than a point
location, such as how to reach a location, the temporal duration that a user
stayed in a location, the time length for travelling between two locations, and the
physical/traffic conditions of a route. As a result, we are more likely to accurately
understand an individuals behavior and interests in a trajectory-centric LBSN.
The other is that in a point-location-driven LBSN users usually share their real-
time location while the trajectory-centric more likely delivers historical locations
as users typically prefer to upload a trajectory after a trip has finished (though
it can be operated in a continuously uploading manner). This property could
compromise some scenarios based on the real-time location of a user, however, it
reduces to some extent the privacy issues in a location-based social network. In
other words, when people see a users trajectory the user is no longer there.
Table 2.1. Comparison of different location based social networks
LBSN Services Focus Real-time Information
Geo-tagged-media-based Media Normal Poor
Point-location-driven Point location Instant Normal
Trajectory-centric Trajectory Relatively Slow Rich
Actually, the location data generated in the first two LBSN services can be
converted into the form of a trajectory which might be used by the third category
of LBSN service. For example, if we sequentially connect the point locations of
the geo-tagged photos taken by a user over several days, a sparse trajectory can be
formulated. Likewise, the check-in records of an individual ordered by time can
be regarded as a low-sampling-rate trajectory. However, due to the sparseness,
i.e., the distance and time interval between two consecutive points in a trajectory
could be very big, the uncertainty existing in a single trajectory from the first
two services is increased. Aiming to put these trajectories into trajectory-centric
LBSN services, we need to use them in a collective and collaborative way.
Trajectory data is the most complex data structure to be found in the three
7

2.2. Collaborative Recommendation Based Social Network [20] 8
LBSN services, and provides the richest information. If it is handled well, other
data sources become easier to deal with. Moreover, as mentioned above, loca-
tion data can be converted into a trajectory on many occasions. Consequently,
some methodologies designed for trajectory data can be employed by the first two
LBSN services.
2.2 Collaborative Recommendation Based So-
cial Network [20]
With the recent advances in technology, there is an emerging presence of social
media and social networking systems. In the case of multimedia enriched social
network systems, such as last.fm, the collective goods are musical tracks and the
collective action is the process of crafting individual profiles of musical preference
and linking them either explicitly, via bonds of friendship, or implicitly, through
collaborative annotation.
This collective action leads to the creation of an implicit social networking struc-
ture, which we aim to further explore. In particular given the success of item
recommendation systems in commercial websites, such as Amazon.com and Net-
flix, it is considered worthwhile to revisit the recommendation problem through
the novel perspective of social networking. In general, recommendation systems
aim to provide personalized recommendations of items to users based on their
previous behavior as well as on other information gathered by item descriptions
and user profiles.
However, no emphasis has been placed yet on personalization based explicitly on
social networks. The reason is that despite there is an increasing interest in the
exploration of social networks, there does not exist a concrete dataset that in-
cludes both explicit bonds of friendships among users and free-form collaborative
annotation of items. This is due to that most social media systems do not allow
for free access to all user profiles or lists of friends.
Given the incentives of the widespread add option of social networks and of the
8

lack of some previous study that directly addresses the problem of efficiently in-
tegrating the added value knowledge provided by those networks in the field of
collaborative recommendation, we propose a new methodology that tackles the
aforementioned issues. Within this context we make the following contributions:
• Kontas et al. [20] introduce a dataset based on data from the last.fm so-
cial network that describes a social graph among users, tracks and tags,
effectively including bonds of friendship and collaborative annotation.
• Kontas et al. [20] evaluate a Random Walk with Restarts (RWR) model
on this dataset and show that the incorporation of friendship and social
tagging can improve the performance of an item recommendation system.
• Kontas et al. [20] show that the RWR method outperforms the standard
Collaborative Filtering (CF) method, which we also evaluate against the
same dataset.
• Kontas et al. [20] show that our method using the RWR method requires
no training and successfully manages to capture
Kontas et al. [20] may distinguish two broad categories of collaborative recom-
mendation systems, namely content-based and collaborative filtering. A content-
based system selects items based on the correlation between the content of the
items (e.g. keywords describing the items, such as album genre, artists, etc., for
music tracks) and the users’ preferences [5]. However, it is limited to dictionary-
bound relations between the keywords used by users and the descriptions of items
and therefore does not explore implicit associations between users.
Collaborative filtering systems are divided into two categories, i.e. memory-
based and model-based. In the memory based systems [21] we calculate the
similarity between all users, based on their ratings of items using some heuristic
measure such as the cosine similarity or the Pearson correlation score. Then we
predict a missing rate by aggregating the ratings of the k nearest neighbors of
9

the user we want to recommend to. The problem with memory-based systems is
that we have to decide on a rather arbitrary basis over parameters such as the
number of neighbors. What is more, in the case of social networks there is no
straightforward way to introduce similarities between users based on friendships
and social tagging, other than some way of ad hoc interpolation of similarity
weights from those different sources.
The model-based filtering systems assume that the users build up clusters based
on their similar behavior in rating of items. A model is learned based on patterns
recognized in the rating behaviors of users using clustering, Bayesian networks
and other machine learning techniques [22] [23]. The problem with model-based
methods is that it is necessary to fine-tune several parameters of the model as
well as the fact that the models produced might not generalize well in radically
different context. What is more, as in the case of memory-based systems extra
effort and training needs to be done in order to introduce knowledge from social
networks.
Many research publications have been lately revolving around the area of so-
cial media. In particular, several studies focus on dataset collection and analysis
from social networks. Das et al. [24] proposed sample based algorithms that
capture information in the neighborhood of a user in dynamic social networks
utilizing random walks. Halpin et al. [25] studied the distribution of tags in
the social bookmarking site del.icio.us and proposed a generative model of col-
laborative tagging in order to evaluate the dynamics that lie beneath the act of
collaborative recommendation. Their findings prove that the dataset collected fol-
lows a power-law distribution. Even though both studies examine social networks
that are based on social tagging, they do not explore the dynamics of friendships
among users. Taking into account the power of free-form tagging of items by users
other than their authors/owners, researchers also focus on tag recommendation.
Subramanya and Liu [26] propose a system that automatically recommends tags
for blogs, using similarity ranking in a manner similar to collaborative filtering
techniques. Stromhaier [27] studies a novel idea in tag recommendation, which
bridges the gap between the keywords issued by a user in a query and the tags
actually used by a social system. He argues that the tags used by a user when
10

performing a query exhibit his or her intent, whereas the annotations of items
describe content semantics. As a result, he proposes a new form of purpose tags,
which extract the intent of the user and facilitate goal oriented search in a social
network. Both studies underline the importance and discriminative power of so-
cial tagging, which is also validated by our work.
Several studies exist in the field of applying Random Walks on bipartite
graphs. Craswell and Szummer [28] study a clickthrough data graph in order
to perform item recommendation. Nevertheless, no social content is available
between users. Yildirim and Krishnamoorthy [23] propose a novel recommenda-
tion algorithm which performs Random Walks on a graph that denotes similarity
measures between items. They evaluate their system using data from Movie Lens.
Although, the use of the Random Walk model performs well in the context of
recommendation, their use of an Item-Item similarity matrix raises some issues
as to the ability of the system to extend when other similarities are introduced
based on social tagging. Recent work has also been done in the field of applying
Random Walks over a social graph instead of bipartite graphs, similar to what
we propose in this paper. Clements et al. [29] propose a single term query system
performing Random Walks on graphs including users, items and tags. They use
data from LibraryThing, an online book catalogue where users rate and tag books
they have read. Due to lack of ground truth, they assume that the tags assigned
to an item by each user are the same as they would use as query terms to retrieve
the annotated item. We argue that this assumption is rather strong and that
a user experiment would be more appropriate in order to properly establish the
ground truth.
Hotho et al. evaluate a variation of adapted PageRank on a dataset from del.icio.us,
exploring folksonomies of bookmarks based also on collaborative annotation [30].
However, since they evaluate their proposed algorithm empirically, any compar-
ison attempts to their results becomes cumbersome. Although both studies are
close to our approach, we use a different model, namely RWR, in which we explic-
itly include friendships in our dataset and perform collaborative recommendations
instead of queries on the graph.
11

2.3. Sentiment Intensity Analysis of Informal Texts [31] 12
2.3 Sentiment Intensity Analysis of Informal Texts
[31]
The proliferation of social networks such as blogs, forums and other online means
of expression and communication have resulted in a landscape where people are
able to freely discuss online through a variety of means and applications.
Probably one of the most novel and interesting way of communication in cy-
berspace is through 3D virtual environments. In such environments, people, rep-
resented by their avatars, socialize and interact with each other and with virtual
humans operated by machines i.e., computer systems.
Despite the fact that the graphics of those environments remain relatively poor,
futuristic movies such as Avatar [32] provide an example of sophisticated land-
scapes and renderings that will be attainable by such environments in the fore-
seeable future. However, regardless of how attractive and realistic such artificial
3D worlds become, they will always remain heavily dependant on the quality of
human communication that takes place within them. As shown in [33] [34] [35],
communication in environments that are not limited to one, textual modality,
consists of not just semantic data transfer, but also of dense non-verbal commu-
nication where sentiment plays an important role. Moreover, without emotion
no consistent and coherent (virtual) body language is possible. Such primordial
movements include facial expressions, eye looks, arm-language coordination, etc.
Sentiment detection from textual utterances can play an important role in the
development of realistic and interactive dialog systems. Such systems serve var-
ious educational, business or entertainment oriented functions and also include
systems that are deployed in 3D virtual environments. With the aid of dialog
coherence” modules, conversational systems aim at a realistic interaction flow at
the emotional level e.g., Affect Listeners [36] and can greatly benefit from the
correct identification of the emotional state of their participants. Taking into
consideration that the majority of input to practical conversational systems con-
stitute of short, informal, textual exchanges, it is essential that the sentiment
analysis component integrated in the dialog system is able to cope with this type
of informal, often incomplete or ill-formed type of communication.
Sentiment analysis, the process of automatically detecting if a text segment con-
12

tains emotional or opinionated content and extracting its polarity or valence, is
a field of research that has received significant attention in recent years, both in
academia and in industry. The aforementioned increase of user-generated con-
tent on the web has resulted in a wealth of information that is potentially of vital
importance to institutions and companies, providing them with data to research
their consumers, manage their reputations and identify new opportunities. As
a result, most of the research in the field has been limited to product reviews,
where the aim is to predict whether the reviewer recommends a product or not,
based on the textual content of the review.
The focus of this paper is different. Instead of focusing our attention to prod-
uct reviews, we explore a more ubiquitous field of informal, social interactions in
cyberspace. The unprecedented popularity of social platforms such as Facebook,
Twitter, MySpace as well as 3D virtual worlds has resulted in an unparallel in-
crease of textual exchanges that remains relatively unexplored especially in terms
of its emotional content.
Specifically, Paltoglou et al. [31] aim to answer the following question: can lexicon-
based approaches perform more effectively than machine-learning approaches in
this domain? This question is particularly important, because previous research
in sentiment analysis using product reviews has shown that machine-learning ap-
proaches typically outperform lexicon-based ones but no exploration of whether
the same holds for informal, social interactions has been carried in the past. The
difference between the two domains is numerous. Firstly, reviews tend to be
longer and more verbose than typical social interactions which may only be a
few words long and often contain significant spelling errors [37]. Secondly, no
clear “golden standard” exists in the domain of informal communications with
which to train a machine-learning classifier in opposition to the “thumbs up” or
“thumbs down” feature of reviews. Lastly, social exchanges on the web tend to
be much more diverse in terms of their topics with issues ranging from politics
and recent news to religion while in contrast; product reviews by definition have
a specific subject, i.e. the product under discussion. The study of emotional
and social interactions in virtual worlds implies the study of virtual human (VH)
behaviors. Two types of VH exist: avatars (i.e. the projection of a real human in
the 3D environment) and agents (i.e. the projection of an autonomous machine
13

simulating a human in the virtual world). These VH types result in three possible
types of communications: avatar to avatar, agent to agent and avatar to agent.
Each one of those has the following interesting aspects respectively:
- A non verbal body language based on VH emotional states and mind profile.
- A potential visualization of the interaction from a third VH that should be
represented by an avatar.
- A non-verbal communication for the human representation and an action of
agent strongly influenced by interpreted emotions from the avatar. It
seems only logical that artificial intelligence and conversation systems would
strongly benefit these aspects in order to make the communication more re-
alistic. The structure of this paper is as follows. The next section provides
a brief overview of relevant work in sentiment analysis. Section 3 presents
the lexicon based classifier and section 4 presents the two machine-learning
classifiers that will be used in this study. Section 5 describes the data sets
that were used and explains the experimental setup while section 6 presents
and analyzes the results.
Finally, Paltoglou et al. [31] conclude and present some potential future directions
of research. Sentiment analysis, also known as opinion mining, has known con-
siderable interest recently. Most research has focused on analyzing the content
of either movie or general product reviews (e.g. [38]). Attempts to expand the
application of sentiment analysis to other domains, such as debates [39], news and
blogs [40] are also prominent. The seminal book of Pang and Lee [41] presents a
thorough analysis of the work in the field. In this section we will focus on the more
prominent work which is relevant to our approach. Pang et al. [46] were amongst
of the first to explore the sentiment analysis of reviews, focusing on machine-
learning approaches. These approaches generally function as follows: initially, a
general inductive process learns the characteristics of a class during a training
phase, by observing the properties of a number of pre classified documents (i.e.
reference corpus ) and applies the acquired knowledge to determine the best cat-
egory for new, unseen documents, during testing. Pang et al. [46] experimented
14

with three different algorithms: Support Vector Machines (SVMs), Naive Bayes
and Maximum Entropy classifiers, using a variety of features, such as unigrams
and bigrams, part-of-speech tags, binary and term frequency feature weights and
others. Their best attained accuracy in a dataset consisting of movie reviews, was
attained using a SVM classifier with binary features, although all three classifiers
gave very comparable performance. Other approaches (e.g. [42] [43]) have focused
on extending the feature set with semantically or linguistically-driven features
in order to improve classification accuracy. Dictionary/lexicon-based sentiment
analysis is typically based on lists of words with some sort of pre-determined
emotional weight. Examples of such dictionaries include the General Inquirer
(GI) dictionary [44] and the “Linguistic Inquiry and Word Count” (LIWC) soft-
ware [45], which are also used in the present study. Both lexicons are build with
the aid of experts that classify certain tokens in terms of their affective content
(e.g. positive or negative). The “Affective Norms for English Words” (ANEW)
lexicon [46] contains ratings of terms on a nine-point scale in regard to three
individual dimensions: valence, arousal and dominance. The ratings were pro-
duced manually by psychology class students. Ways to produce such emotional
dictionaries in an automatic or semi-automatic fashion have also been introduced
in research [47]. Emotional dictionaries have mostly been utilized in psychology
or sociology oriented research [48].
The idea of emotional conversationalists is relatively old. First attempts to create
such a system can be traced back to Parry [49], a chatterbot intended for studying
the nature of paranoia and able to express fears, anxieties or beliefs. More recent
work include research on the development of synthetic characters and chatterbots
with personalities [50] and studies on emotional responses and their influence on
the creation of believable agents or interactive virtual personalities [51]. In [52]
authors focused on the role of emotions for gaining rapport in spoken dialog sys-
tems by rendering responses that contain suitable emotion, both lexically and
auditory. Studies on the role of facial expressions in building rapport in a virtual
human-users interactions were conducted in [53]. A chatterbot system that gen-
erates emotional responses by selecting and displaying expressive images of the
character emulated by the chatterbot was presented in [54]. It has been almost
two decades that emotional communication for virtual worlds is a challenging
15

2.4. Big Five modeling [1] 16
research field. One of the pioneer paper has been proposed by Cassel et al. [55].
In the proposed system, conversations between multiple human-like agents were
automatically generates and animates with appropriate and synchronized speech,
intonation, facial expressions, and hand gestures proposed numerous ways to
design personality and emotion models for virtual humans. More recently, pre-
dicted a specific personality and emotional states from hierarchical fuzzy rules to
facilitate personality and emotion control, and in 2009, Pelachaud et al. [56] de-
veloped a model of behavior expressivity using a set of six parameters that act as
modulation of behavior animation. Finally, this year, [35] introduced a graphical
representation of human emotion extracted from text sentences. The main con-
tributions of that approach included an original pipeline that extracts, processes,
and renders emotion of 3D VH. Additionally, the paper presented methods to
optimize the computational pipeline so that real time virtual reality rendering
can be achieved on common PCs. Lastly, it was demonstrated how the Poisson
distribution can be utilized to transfer database extracted lexical and language
parameters into coherent intensities of valence and arousal (i.e. parameters of
Russell’s circumplex model of emotion).
2.4 Big Five modeling [1]
At present, many researchers believe that there are five core personality traits
and the evidence of this theory has been growing over the past 50 years [1]. From
the point of view of a sociologist, social media can be characterized as collective
goods produced through computer-mediated collective action [57]. While people
of each category have different attitude corresponding sites, taste of products,
different skill to accomplish work. The five factors are Extraversion, Agreeable-
ness, Conscientiousness, Neuroticism and Openness [58]. The people of different
categories have different ways to express their thoughts and OSN users have dif-
ferent level of significance to express their thoughts or behavior [1] [4]. The users
of OSN can be categorize according to Big Five factors. The behavior of an OSN
user varies from users location to location but there is a similarity having same
behavior in people from same or nearby location [59]. Behavior also varies from
16

different aged people.
The personality traits used in the 5 factor model are Extraversion, Agreeableness,
Conscientiousness, Neuroticism and Openness to experience [58]. It is important
to ignore the positive or negative associations that these words have in everyday
language. For example, Agreeableness is obviously advantageous for achieving
and maintaining popularity. Agreeable people are better liked than disagreeable
people. On the other hand, agreeableness is not useful in situations that require
tough or totally objective decisions. Disagreeable people can make excellent sci-
entists, critics, or soldiers. Remember, none of the five traits is in themselves
positive or negative, they are simply characteristics that individuals exhibit to a
greater or lesser extent.
Each of these 5 personality traits describes, relative to other people, the frequency
or intensity of a person’s feelings, thoughts, or behaviors. Everyone possesses all
5 of these traits to a greater or lesser degree. For example, two individuals could
be described as agreeable (agreeable people value getting along with others). But
there could be significant variation in the degree to which they are both agree-
able. In other words, all 5 personality traits exist on a continuum rather than as
attributes that a person does or does not have.
Each of the Big Five personality traits is made up of 6 facets or sub traits. These
can be assessed independently of the trait that they belong to.
• Extraversion
Extraversion is marked by pronounced engagement with the external world.
Extraverts enjoy being with people, are full of energy, and often experience
positive emotions. They tend to be enthusiastic, action-oriented, individu-
als who are likely to say “Yes!” or “Let’s go!” to opportunities for excite-
ment. In groups they like to talk, assert themselves, and draw attention to
themselves. Introverts lack the exuberance, energy, and activity levels of
extraverts. They tend to be quiet, low-key, deliberate, and disengaged from
the social world. Their lack of social involvement should not be interpreted
as shyness or depression; the introvert simply needs less stimulation than
an extravert and prefers to be alone. The independence and reserve of the
introvert is sometimes mistaken as unfriendliness or arrogance. In reality,
17

an introvert who scores high on the agreeableness dimension will not seek
others out but will be quite pleasant when approached.
Extraversion Facets:
– Friendliness. Friendly people genuinely like other people and openly
demonstrate positive feelings toward others. They make friends quickly
and it is easy for them to form close, intimate relationships. Low scor-
ers on Friendliness are not necessarily cold and hostile, but they do
not reach out to others and are perceived as distant and reserved.
– Gregariousness. Gregarious people ﬁnd the company of others pleas-
antly stimulating and rewarding. They enjoy the excitement of crowds.
Low scorers tend to feel overwhelmed by, and therefore actively avoid,
large crowds. They do not necessarily dislike being with people some-
times, but their need for privacy and time to themselves is much greater
than for individuals who score high on this scale.
– Assertiveness. High scorers Assertiveness like to speak out, take charge,
and direct the activities of others. They tend to be leaders in groups.
Low scorers tend not to talk much and let others control the activities
of groups.
– Activity Level. Active individuals lead fast-paced, busy lives. They
move about quickly, energetically, and vigorously, and they are in-
volved in many activities. People who score low on this scale follow a
slower and more leisurely, relaxed pace.
– Excitement-Seeking. High scorers on this scale are easily bored with-
out high levels of stimulation. They love bright lights and hustle and
bustle. They are likely to take risks and seek thrills. Low scorers are
overwhelmed by noise and commotion and are adverse to thrill-seeking.
– Cheerfulness. This scale measures positive mood and feelings, not neg-
ative emotions (which are a part of the Neuroticism domain). Persons
who score high on this scale typically experience a range of positive
18

feelings, including happiness, enthusiasm, optimism, and joy. Low
scorers are not as prone to such energetic, high spirits.
• Agreeableness
Agreeableness reflects individual differences in concern with cooperation
and social harmony. Agreeable individuals value getting along with others.
They are therefore considerate, friendly, generous, helpful, and willing to
compromise their interests with others’. Agreeable people also have an op-
timistic view of human nature. They believe people are basically honest,
decent, and trustworthy. Disagreeable individuals place self-interest above
getting along with others. They are generally unconcerned with others’
well-being, and therefore are unlikely to extend themselves for other peo-
ple. Sometimes their skepticism about others’ motives causes them to be
suspicious, unfriendly, and uncooperative. Agreeableness is obviously ad-
vantageous for attaining and maintaining popularity. Agreeable people are
better liked than disagreeable people. On the other hand, agreeableness is
not useful in situations that require tough or absolute objective decisions.
Disagreeable people can make excellent scientists, critics, or soldiers.
Agreeableness Facets:
– Trust. A person with high trust assumes that most people are fair,
honest, and have good intentions. Persons low in trust may see others
as selfish, devious, and potentially dangerous.
– Morality. High scorers on this scale see no need for pretence or ma-
nipulation when dealing with others and are therefore candid, frank,
and sincere. Low scorers believe that a certain amount of deception in
social relationships is necessary. People find it relatively easy to relate
to the straightforward high-scorers on this scale. They generally find
it more difficult to relate to the low-scorers on this scale. It should be
made clear that low scorers are not unprincipled or immoral; they are
simply more guarded and less willing to openly reveal the whole truth.
19

– Altruism. Altruistic people find helping other people genuinely re-
warding. Consequently, they are generally willing to assist those who
are in need. Altruistic people find that doing things for others is a
form of self-fulfillment rather than self-sacrifice. Low scorers on this
scale do not particularly like helping those in need. Requests for help
feel like an imposition rather than an opportunity for self-fulfillment.
– Cooperation. Individuals who score high on this scale dislike con-
frontations. They are perfectly willing to compromise or to deny their
own needs in order to get along with others. Those who score low on
this scale are more likely to intimidate others to get their way.
– Modesty. High scorers on this scale do not like to claim that they are
better than other people. In some cases this attitude may derive from
low self-confidence or self-esteem. Nonetheless, some people with high
self-esteem find immodesty unseemly. Those who are willing to de-
scribe themselves as superior tend to be seen as disagreeably arrogant
by other people.
– Sympathy. People who score high on this scale are tender-hearted and
compassionate. They feel the pain of others vicariously and are easily
moved to pity. Low scorers are not affected strongly by human suf-
fering. They pride themselves on making objective judgments based
on reason. They are more concerned with truth and impartial justice
than with mercy.
• Conscientiousness
Conscientiousness concerns the way in which we control, regulate, and direct
our impulses. Impulses are not inherently bad; occasionally time constraints
require a snap decision, and acting on our first impulse can be an effective
response. Also, in times of play rather than work, acting spontaneously
and impulsively can be fun. Impulsive individuals can be seen by others as
colorful and fun-to-be-with.
Nonetheless, acting on impulse can lead to trouble in a number of ways.
Some impulses are antisocial. Uncontrolled antisocial acts not only harm
20

other members of society, but also can result in retribution toward the
perpetrator of such impulsive acts. Another problem with impulsive acts is
that they often produce immediate rewards but undesirable, long-term con-
sequences. Examples include excessive socializing that leads to being fired
from one’s job, hurling an insult that causes the breakup of an important
relationship, or using pleasure-inducing drugs that eventually destroy one’s
health.
Impulsive behavior, even when not seriously destructive, diminishes a per-
son’s effectiveness in significant ways. Acting impulsively disallows con-
templating alternative courses of action, some of which would have been
wiser than the impulsive choice. Impulsivity also sidetracks people during
projects that require organized sequences of steps or stages. Accomplish-
ments of an impulsive person are therefore small, scattered, and inconsis-
tent.
A hallmark of intelligence, what potentially separates human beings from
earlier life forms, is the ability to think about future consequences before
acting on an impulse. Intelligent activity involves contemplation of long-
range goals, organizing and planning routes to these goals, and persisting
toward one’s goals in the face of short-lived impulses to the contrary. The
idea that intelligence involves impulse control is nicely captured by the term
prudence, an alternative label for the Conscientiousness domain. Prudent
means both wise and cautious. Persons who score high on the Conscien-
tiousness scale are, in fact, perceived by others as intelligent.
The benefits of high conscientiousness are obvious. Conscientious individ-
uals avoid trouble and achieve high levels of success through purposeful
planning and persistence. They are also positively regarded by others as
intelligent and reliable. On the negative side, they can be compulsive perfec-
tionists and workaholics. Furthermore, extremely conscientious individuals
might be regarded as stuffy and boring. Unconscientious people may be
criticized for their unreliability, lack of ambition, and failure to stay within
the lines, but they will experience many short-lived pleasures and they will
never be called stuffy.
21

Conscientiousness Facets:
– Self-Efficacy. Self-Efficacy describes confidence in one’s ability to ac-
complish things. High scorers believe they have the intelligence (com-
mon sense), drive, and self-control necessary for achieving success. Low
scorers do not feel effective, and may have a sense that they are not in
control of their lives.
– Orderliness. Persons with high scores on orderliness are well-organized.
They like to live according to routines and schedules. They keep lists
and make plans. Low scorers tend to be disorganized and scattered.
– Dutifulness. This scale reflects the strength of a person’s sense of duty
and obligation. Those who score high on this scale have a strong sense
of moral obligation. Low scorers find contracts, rules, and regulations
overly confining. They are likely to be seen as unreliable or even
irresponsible.
– Achievement-Striving. Individuals who score high on this scale strive
hard to achieve excellence. Their drive to be recognized as successful
keeps them on track toward their lofty goals. They often have a strong
sense of direction in life, but extremely high scores may be too single-
minded and obsessed with their work. Low scorers are content to get
by with a minimal amount of work, and might be seen by others as
lazy.
– Self-Discipline. What many people call will-power refers to the ability
to persist at difficult or unpleasant tasks until they are completed.
People who possess high self-discipline are able to overcome reluctance
to begin tasks and stay on track despite distractions. Those with low
self-discipline procrastinate and show poor follow-through, often failing
to complete tasks-even tasks they want very much to complete.
– Cautiousness. Cautiousness describes the disposition to think through
possibilities before acting. High scorers on the Cautiousness scale take
their time when making decisions. Low scorers often say or do first
22

thing that comes to mind without deliberating alternatives and the
probable consequences of those alternatives.
• Neuroticism
The term neurosis is used to describe a condition marked by mental distress,
emotional suffering, and an inability to cope effectively with the normal de-
mands of life. It is suggested that everyone shows some signs of neurosis,
but that we differ in our degree of suffering and our specific symptoms of
distress. Today neuroticism refers to the tendency to experience negative
feelings. Those who score high on Neuroticism may experience primarily
one specific negative feeling such as anxiety, anger, or depression, but are
likely to experience several of these emotions. People high in neuroticism
are emotionally reactive. They respond emotionally to events that would
not affect most people, and their reactions tend to be more intense than
normal. They are more likely to interpret ordinary situations as threaten-
ing, and minor frustrations as hopelessly difficult. Their negative emotional
reactions tend to persist for unusually long periods of time, which means
they are often in a bad mood. These problems in emotional regulation can
diminish a neurotic’s ability to think clearly, make decisions, and cope ef-
fectively with stress.
At the other end of the scale, individuals who score low in neuroticism are
less easily upset and are less emotionally reactive. They tend to be calm,
emotionally stable, and free from persistent negative feelings. Freedom from
negative feelings does not mean that low scorers experience a lot of positive
feelings; frequency of positive emotions is a component of the Extraversion
domain.
Neuroticism Facets:
– Anxiety. The ”fight-or-flight” system of the brain of anxious individ-
uals is too easily and too often engaged. Therefore, people who are
high in anxiety often feel like something dangerous is about to happen.
23

They may be afraid of specific situations or be just generally fearful.
They feel tense, jittery, and nervous.
– Anger. Persons who score high in Anger feel enraged when things do
not go their way. They are sensitive about being treated fairly and
feel resentful and bitter when they feel they are being cheated. This
scale measures the tendency to feel angry; whether or not the person
expresses annoyance and hostility depends on the individual’s level on
Agreeableness. Low scorers do not get angry often or easily.
– Depression. This scale measures the tendency to feel sad, dejected,
and discouraged. High scorers lack energy and have difficult initiating
activities. Low scorers tend to be free from these depressive feelings.
– Self-Consciousness. Self-conscious individuals are sensitive about what
others think of them. Their concern about rejection and ridicule cause
them to feel shy and uncomfortable abound others. They are eas-
ily embarrassed and often feel ashamed. Their fears that others will
criticize or make fun of them are exaggerated and unrealistic, but
their awkwardness and discomfort may make these fears a self-fulfilling
prophecy. Low scorers, in contrast, do not suffer from the mistaken
impression that everyone is watching and judging them. They do not
feel nervous in social situations.
– Immoderation. Immoderate individuals feel strong cravings and urges
that they have difficulty resisting. They tend to be oriented toward
short-term pleasures and rewards rather than long-term consequences.
Low scorers do not experience strong, irresistible cravings and conse-
quently do not find themselves tempted to overindulge.
– Vulnerability. High scorers on Vulnerability experience panic, confu-
sion, and helplessness when under pressure stress. Low scorers feel
more poised, confident, and clear-thinking when stressed.
24

• Openness to Experience
Openness to Experience describes a dimension of cognitive style that dis-
tinguishes imaginative, creative people from down-to-earth, conventional
people. Open people are intellectually curious, appreciative of art, and
sensitive to beauty. They tend to be, compared to closed people, more
aware of their feelings. They tend to think and act in individualistic and
nonconforming ways. Intellectuals typically score high on Openness to Ex-
perience; consequently, this factor has also been called Culture or Intellect.
Nonetheless, Intellect is probably best regarded as one aspect of openness
to experience. Scores on Openness to Experience are only modestly related
to years of education and scores on standard intelligent tests.
Another characteristic of the open cognitive style is a facility for thinking in
symbols and abstractions far removed from concrete experience. Depend-
ing on the individual’s speciﬁc intellectual abilities, this symbolic cognition
may take the form of mathematical, logical, or geometric thinking, artistic
and metaphorical use of language, music composition or performance, or
one of the many visual or performing arts. People with low scores on open-
ness to experience tend to have narrow, common interests. They prefer the
plain, straightforward, and obvious over the complex, ambiguous, and sub-
tle. They may regard the arts and sciences with suspicion, regarding these
endeavors as abstruse or of no practical use. Closed people prefer familiar-
ity over novelty; they are conservative and resistant to change. Openness
is often presented as healthier or more mature by psychologists, who are
often themselves open to experience. However, open and closed styles of
thinking are useful in diﬀerent environments. The intellectual style of the
open person may serve a professor well, but research has shown that closed
thinking is related to superior job performance in police work, sales, and a
number of service occupations.
Openness to Experience Facets:
– Imagination. To imaginative individuals, the real world is often too
25

plain and ordinary. High scorers on this scale use fantasy as a way of
creating a richer, more interesting world. Low scorers are on this scale
are more oriented to facts than fantasy.
– Artistic Interests. High scorers on this scale love beauty, both in art
and in nature. They become easily involved and absorbed in artistic
and natural events. They are not necessarily artistically trained or
talented, although many will be. The defining features of this scale
are interest in, and appreciation of natural and artificial beauty. Low
scorers lack aesthetic sensitivity and interest in the arts.
– Emotionality. Persons high on Emotionality have good access to and
awareness of their own feelings. Low scorers are less aware of their
feelings and tend not to express their emotions openly.
– Adventurousness. High scorers on adventurousness are eager to try
new activities, travel to foreign lands, and experience different things.
They find familiarity and routine boring, and will take a new route
home just because it is different. Low scorers tend to feel uncomfort-
able with change and prefer familiar routines.
– Intellect. Intellect and artistic interests are the two most important,
central aspects of openness to experience. High scorers on Intellect love
to play with ideas. They are open-minded to new and unusual ideas,
and like to debate intellectual issues. They enjoy riddles, puzzles, and
brain teasers. Low scorers on Intellect prefer dealing with people or
things rather than ideas. They regard intellectual exercises as a waste
of time. Intellect should not be equated with intelligence. Intellect
is an intellectual style, not an intellectual ability, although high scor-
ers on Intellect score slightly higher than low-Intellect individuals on
standardized intelligence tests.
– Liberalism. Psychological liberalism refers to a readiness to challenge
authority, convention, and traditional values. In its most extreme
form, psychological liberalism can even represent outright hostility to-
ward rules, sympathy for law-breakers, and love of ambiguity, chaos,
and disorder. Psychological conservatives prefer the security and sta-
26

bility brought by conformity to tradition. Psychological liberalism and
conservatism are not identical to political aﬃliation, but certainly in-
cline individuals toward certain political parties.
It is possible, although unusual, to score high in one or more facets of a per-
sonality trait and low in other facets of the same trait. For example, you could
score highly in Imagination, Artistic Interests, Emotionality and Adventurous-
ness, but score low in Intellect and Liberalism.
27

Chapter 3
Research Questions
The main objective of this paper is to draw user’s virtual behavior model by an-
alyzing his/her OSN existence and to recommend products to the user on basis
of the user’s behavior model. To reach our main goal, we need to consider few
sub objectives, such as collecting user’s social network activities, analyizing the
user’s activity for few days, categorize the user’s activity in Big Five factors, rec-
ommending some services or products to the user on basis of the user’s behavior
model.
In order to fulﬁll our objectives some research questions will arise. The main
research question of this paper is: How to categorize users of OSN according to
Big Five factors from their behaviours in OSN? The sub research questions are
1. How do OSN(Online Social Networks) represent one user?
2. How can we analysis user behavior ?
3. How to categorize user behavior in Big Five factors?
28

Chapter 4
Proposed Research Methodology
In this paper our aim is to make relationship among text corpus from social
network with psychological theory of personality. We will also try to imple-
ment a recommendation system based on behavior analysis. So correlational and
exploratory methodologies are used in this paper where our concept is Behav-
ior indicator in Big Five Modeling and variables are Extraversion, Neuroticism,
Agreeableness, Openness and Conscientiousness.
• 4.1 Data Collection: In this research to categorize user’s behavior the big
data is collected. The data is collected from OSN(Twitter). The data is
stored in OSN by user’s activities such as posts by the user, posts by the
user’s friends, liked pages etc. The collected data is the public data so there
is no barrier to use these data. At a time a user’s previous 30 days data
will be collected. Data will be directly collected by the system from OSN
by full user authorization. After collecting data it will be stored in system
database with security.
Twitter, a social network site, can be used for sentiment analysis as it has
a very large number of short messages created by its users [60]. So we used
Twitter to collect users’ data. Using Twitter REST api 1.1, we collected
public tweets and re-tweets. Our twitter app requires users to authorize
the app for extracting data from their proﬁles. The twitter app will not
collect data if users do not allow it to run. We made sure all data we
29

30
USER
LIWC
Mapping
OSN(Twitter)
Twitter API
Represents
Figure 4.1. Modeling User Behavior
extract from twitter is public data. By calling get statuses/user timeline
and get statuses/retweets of me methods we can collect the user’s tweets
and retweets. The system can also collect public data from profiles that the
user is currently following by using get friends/ids method. The data we
collected are in json format and our twitter app can write the data to text
files. As separated files are easier to use we separated each user’s data file
by using user’s unique identifier- userid or username.
30

31
• 4.2 Data Analysis: Text file which contain past data of a single user is an-
alyzed through LIWC (Linguistic Inquiry and Word Count). It is a text
analysis software program designed by James W. Pennebaker, Roger J.
Booth and Martha E. Each text file analyzed by LIWC2007 can be treated
as a whole or broken into segments. It counts the words according to its
dictionary. After finishing this process it saves in a specified file where the
result is written on the below corresponding its category. Where, these
categories indicate different aspects of Big Five factors. On basis of these
results the modelling is implemented. The data table is given below which
shows which category lies in which factor.
Table 4.1. Relationship between LIWC categories and Big Five factors
Big Five factors LIWC Categories
Extraversion Social process, Family, Friends, Humans, Affec-
tive, Biological process, Sexual, Achievement
Openness to Experience Leisure, Insight, Body, Ingestion
Neuroticism Swear words, Negation, Negative emotion, Anger
, Sadness, Sexual
Conscientiousness Relativity, Motion, Space, Time, Religion, Death,
Money, Certainty
Agreeableness Positive Emotion, Feel, Discrepancy(would), Ten-
tative(maybe), Hear
The collected data is analyzed by LIWC to split every sentence. Then
according to the Big Five factors and the meaning and the use of words
there will be a percentage marking. After marking the percentage will be
summed and the higher marking category will be taken as user behavior.
31

32
• 4.3 Results Result of total counted words provided by LIWC is in percentage.
LIWC gives the result in such way:
result=(TC*100)/WC Where WC = total words in text ﬁle. TC = total
words in category.
The opposite method is used to know the exact number of words. Where,
TC=(result*100)/WC
Then which categories lie in same factor of the Big Five factors, values
of those categories are summed using linear regression formula. Linear
regression f(X)=X1+X2+X3+. . . +Xi
We used percentaged value of each factor.
Percentage formula part/whole=%/100
These results are used to draw the pie chart using EXCEL.
Example:
Figure 4.2. Pie Chart of LIWC Results
32

33
USER
Figure 4.3. Personality Based Recommendation System
• 4.4 Recommendation Analysis: Depending on the behavior analysis some
brands of products are suggested or recommended to users. Major percent-
age of behavior can inﬂuence one to like a particular type of products. There
are some examples given in table below which show majority of people hav-
ing a particular behavior have interest on a particular brand or product or
service. The following tables show some examples of recommendations.
33

34
As for example user A, B and C are followers of Age of Empires game
page in Twitter. After analyzing their tweets and retweets, machine maps
their behavior and it seems that major part of their behavior is extrovert.
And now after analyzing the tweets and retweets of user X if machine ﬁnds
that majority of his behavior is inﬂuenced by extroversion then we can
recommend him games like Age of Empires.
Table 4.2. Products under Big Five factors
Big Five Factors Product Categories/Brands
Video Games
Extraversion Strategy(Age of Empires, Commandos)
Openness to Experience Racing(Need for Speed)
Neuroticism Shooting(Call of duty, Counter Strike)
Conscientiousness Chess, Sudoku
Agreeableness Sports(Fifa)
Movies
Extraversion Political, Fantasy, Family
Openness to Experience Comedy, Sports, Drama
Neuroticism Crime scene, Action, Horror
Conscientiousness Political, Historical, Conspiracy
Agreeableness Romantic, Drama
Music
Extraversion Rock
Openness to Experience Classical, Vocal, Country wood
Neuroticism Pop, Heavy Metal
Conscientiousness New Released, Historic
Agreeableness Romantic, Country
34

35
Food
Extraversion Bead, Meat
Openness to Experience Multicultural Food, Pizza
Neuroticism Fast Food
Conscientiousness Salad, Vegetable
Agreeableness Bread, Cheese
Beverage
Extraversion Coffee, Tea
Openness to Experience Milkshake, Green Tea
Neuroticism Soft Drinks
Conscientiousness Green tea, Black Coffee
Agreeableness coffee, tea, soft Drinks
Sports
Extraversion Football, Athletics
Openness to Experience Cricket, Swim
Neuroticism Boxing, Rugby, Marshal arts
Conscientiousness Athletics, Marshal arts
Agreeableness Gymnastics
35

Chapter 5
Conclusions
In our thesis we proved that personality can be automated through analyzing
language cues. There has been little work done regarding to this field and to
the very best of our knowledge our research is one of the very first researches to
examine the recognition of personality and to introduce recommendation system
based on sentiment analysis results. During our research we realized that feature
selection is one of the most important tasks, as some of the best models only
contain a small subset of all feature set.
LIWC features are beneficial for all traits. For all recognition tasks we an-
alyzed the influence of the most relevant individual features in specific models.
We also used Stanford NLP (natural language processing) application to analyze
and split the texts. Later we only used LIWC because it generates more accurate
results than Standard NLP for our data analysis.
At this moment our system can only use text information. But in future our
system will be able to analyze data from shared links or videos. Our system
cannot identify quotations (which user uses to share others speech). The system
lacks the ability to understand double negatives in a sentence. For example: “The
service of Samsung Galaxy S3 is not very bad”.
There is a big scope of analyzing exclamatory sentences or smileys(sentimental
expressions). Our system can not understand sarcastic behavior at this moment.
Recommendation system on brands depends more accurately on percentage of
36

37
Big Five factors. Depth of measuring and scale of marking will be more eﬃcient.
37

Bibliography
[1] K. Cherry, “The big five personality dimensions,” 2012. Accessed: 2010-09-
30.
[2] “Facebook.com.” Accessed: 2014-06-01.
[3] “Twitter.com.” Accessed: 2014-06-01.
[4] J. Bao, Y. Zheng, and M. F. Mokbel, “Location-based and preference-aware
recommendation using sparse geo-social networking data,” in Proceedings of
the 20th International Conference on Advances in Geographic Information
Systems, pp. 199–208, ACM, 2012.
[5] A. M. Ferman, J. H. Errico, P. v. Beek, and M. I. Sezan, “Content-based
filtering and personalization using structured metadata,” in Proceedings of
the 2nd ACM/IEEE-CS joint conference on Digital libraries, pp. 393–393,
ACM, 2002.
[6] “Amazon.com.” Accessed: 2014-04-01.
[7] “Netflix.com.” Accessed: 2014-04-01.
[8] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida, “Characterizing user
behavior in online social networks,” in Proceedings of the 9th ACM SIG-
COMM conference on Internet measurement conference, pp. 49–62, ACM,
2009.
[9] N. O. Report, “Social networks and blogs now 4th most popular online ac-
tivity.”
38

BIBLIOGRAPHY 39
[10] Y. Zheng, “Location-based social networks: Users,” in Computing with Spa-
tial Trajectories, pp. 243–276, Springer, 2011.
[11] “Flickr.com.” Accessed: 2014-04-01.
[12] “Foursquare.com.” Accessed: 2014-01-01.
[13] X. Cao, G. Cong, and C. S. Jensen, “Mining significant semantic loca-
tions from gps data,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2,
pp. 1009–1020, 2010.
[14] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma, “Mining interesting locations
and travel sequences from gps trajectories,” in Proceedings of the 18th inter-
national conference on World wide web, pp. 791–800, ACM, 2009.
[15] Q. Li, Y. Zheng, X. Xie, Y. Chen, W. Liu, and W.-Y. Ma, “Mining user sim-
ilarity based on location history,” in Proceedings of the 16th ACM SIGSPA-
TIAL international conference on Advances in geographic information sys-
tems, p. 34, ACM, 2008.
[16] X. Xiao, Y. Zheng, Q. Luo, and X. Xie, “Finding similar users using category-
based location history,” in Proceedings of the 18th SIGSPATIAL Interna-
tional Conference on Advances in Geographic Information Systems, pp. 442–
445, ACM, 2010.
[17] W. Liu, Y. Zheng, S. Chawla, J. Yuan, and X. Xing, “Discovering spatio-
temporal causal interactions in traffic data streams,” in Proceedings of the
17th ACM SIGKDD international conference on Knowledge discovery and
data mining, pp. 1010–1018, ACM, 2011.
[18] Y. Zheng, Q. Li, Y. Chen, X. Xie, and W.-Y. Ma, “Understanding mobility
based on gps data,” in Proceedings of the 10th international conference on
Ubiquitous computing, pp. 312–321, ACM, 2008.
[19] L. Wang, Y. Zheng, X. Xie, and W.-Y. Ma, “A flexible spatio-temporal
indexing scheme for large-scale gps track retrieval,” in Mobile Data Man-
agement, 2008. MDM’08. 9th International Conference on, pp. 1–8, IEEE,
2008.
39

BIBLIOGRAPHY 40
[20] I. Konstas, V. Stathopoulos, and J. M. Jose, “On social networks and col-
laborative recommendation,” in Proceedings of the 32nd international ACM
SIGIR conference on Research and development in information retrieval,
pp. 195–202, ACM, 2009.
[21] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, “An algorithmic
framework for performing collaborative filtering,” in Proceedings of the 22nd
annual international ACM SIGIR conference on Research and development
in information retrieval, pp. 230–237, ACM, 1999.
[22] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recom-
mender systems: A survey of the state-of-the-art and possible extensions,”
Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6,
pp. 734–749, 2005.
[23] H. Yildirim and M. S. Krishnamoorthy, “A random walk method for allevi-
ating the sparsity problem in collaborative filtering,” in Proceedings of the
2008 ACM conference on Recommender systems, pp. 131–138, ACM, 2008.
[24] G. Das, N. Koudas, M. Papagelis, and S. Puttaswamy, “Efficient sampling of
information in social networks,” in Proceedings of the 2008 ACM workshop
on Search in social media, pp. 67–74, ACM, 2008.
[25] H. Halpin, V. Robu, and H. Shepherd, “The complex dynamics of collabora-
tive tagging,” in Proceedings of the 16th international conference on World
Wide Web, pp. 211–220, ACM, 2007.
[26] S. B. Subramanya and H. Liu, “Socialtagger-collaborative tagging for blogs
in the long tail,” in Proceedings of the 2008 ACM workshop on Search in
social media, pp. 19–26, ACM, 2008.
[27] M. Strohmaier, “Purpose tagging: capturing user intent to assist goal-
oriented social search,” in Proceedings of the 2008 ACM workshop on Search
in social media, pp. 35–42, ACM, 2008.
40

BIBLIOGRAPHY 41
[28] N. Craswell and M. Szummer, “Random walks on the click graph,” in Pro-
ceedings of the 30th annual international ACM SIGIR conference on Re-
search and development in information retrieval, pp. 239–246, ACM, 2007.
[29] M. Clements, A. P. de Vries, and M. J. Reinders, “Optimizing single term
queries using a personalized markov random walk over the social graph,”
in Workshop on Exploiting Semantic Annotations in Information Retrieval
(ESAIR), 2008.
[30] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme, Information retrieval in
folksonomies: Search and ranking. Springer, 2006.
[31] G. Paltoglou, S. Gobron, M. Skowron, M. Thelwall, and D. Thalmann, “Sen-
timent analysis of informal textual communication in cyberspace,” Proc. En-
gage, pp. 13–25, 2010.
[32] “Avatarmovie.com.” Accessed: 2014-04-01.
[33] A. Kappas, U. Hess, and K. R. Scherer, “6. voice and emotion,” Fundamen-
tals of nonverbal behavior, p. 200, 1991.
[34] P. Becheiraz and D. Thalmann, “A model of nonverbal communication
and interpersonal relationship between virtual actors,” in Computer Ani-
mation’96. Proceedings, pp. 58–67, IEEE, 1996.
[35] S. Gobron, J. Ahn, G. Paltoglou, M. Thelwall, and D. Thalmann, “From sen-
tence to emotion: a real-time three-dimensional graphics metaphor of emo-
tions extracted from text,” The Visual Computer, vol. 26, no. 6-8, pp. 505–
519, 2010.
[36] M. Skowron, “Affect listeners: Acquisition of affective states by means of
conversational systems,” in Development of Multimodal Interfaces: Active
Listening and Synchrony, pp. 169–181, Springer, 2010.
[37] M. Thelwall and D. Wilkinson, “Public dialogs in social network sites: What
is their purpose?,” Journal of the American Society for Information Science
and Technology, vol. 61, no. 2, pp. 392–404, 2010.
41

BIBLIOGRAPHY 42
[38] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classi-
fication using machine learning techniques,” in Proceedings of the ACL-02
conference on Empirical methods in natural language processing-Volume 10,
pp. 79–86, Association for Computational Linguistics, 2002.
[39] M. Thomas, B. Pang, and L. Lee, “Get out the vote: Determining support
or opposition from congressional floor-debate transcripts,” in Proceedings of
the 2006 conference on empirical methods in natural language processing,
[40] I. Ounis, C. Macdonald, and I. Soboroff, “Overview of the trec-2008 blog
track,” tech. rep., DTIC Document, 2008.
[41] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations
and trends in information retrieval, vol. 2, no. 1-2, pp. 1–135, 2008.
[42] T. Mullen and N. Collier, “Sentiment analysis using support vector machines
with diverse information sources.,” in EMNLP, vol. 4, pp. 412–418, 2004.
[43] C. Whitelaw, N. Garg, and S. Argamon, “Using appraisal groups for senti-
ment analysis,” in Proceedings of the 14th ACM international conference on
Information and knowledge management, pp. 625–631, ACM, 2005.
[44] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity in
phrase-level sentiment analysis,” in Proceedings of the conference on human
language technology and empirical methods in natural language processing,
[45] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry and
word count: Liwc 2001,” Mahway: Lawrence Erlbaum Associates, vol. 71,
p. 2001, 2001.
[46] M. Bradley and P. Lang, “Affective norms for english words (anew): Techni-
cal manual and affective ratings,” Gainesville, FL: The Center for Research
in Psychophysiology, University of Florida, 1999.
[47] J. Brooke, M. Tofiloski, and M. Taboada, “Cross-linguistic sentiment analy-
sis: From english to spanish.,” in RANLP, pp. 50–54, 2009.
42

BIBLIOGRAPHY 43
[48] R. B. Slatcher, C. K. Chung, J. W. Pennebaker, and L. D. Stone, “Winning
words: Individual differences in linguistic style among us presidential and
vice presidential candidates,” Journal of Research in Personality, vol. 41,
no. 1, pp. 63–75, 2007.
[49] K. M. Colby, S. Weber, and F. D. Hilf, “Artificial paranoia,” Artificial In-
telligence, vol. 2, no. 1, pp. 1–25, 1971.
[50] F. Barthelemy, B. Dosquet, S. Gries, and X. Magnant, “Believable synthetic
characters in a virtual emarket,” in Artificial Intelligence and Applications:
IASTED International Conference Proceedings, as part of the 22 nd IASTED
International Multi-Conference on Applied Informatics, 2004.
[51] J. Bates et al., “The role of emotion in believable agents,” Communications
of the ACM, vol. 37, no. 7, pp. 122–125, 1994.
[52] J. C. Acosta, “Using emotion to gain rapport in a spoken dialog system,”
in Proceedings of Human Language Technologies: The 2009 Annual Confer-
ence of the North American Chapter of the Association for Computational
Linguistics, Companion Volume: Student Research Workshop and Doctoral
Consortium, pp. 49–54, Association for Computational Linguistics, 2009.
[53] J. Gratch, N. Wang, J. Gerten, E. Fast, and R. Duffy, “Creating rapport
with virtual agents,” in Intelligent Virtual Agents, pp. 125–138, Springer,
2007.
[54] P. Turney and M. L. Littman, “Unsupervised learning of semantic orientation
from a hundred-billion-word corpus,” 2002.
[55] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket,
B. Douville, S. Prevost, and M. Stone, “Animated conversation: rule-based
generation of facial expression, gesture & spoken intonation for multiple con-
versational agents,” in Proceedings of the 21st annual conference on Com-
puter graphics and interactive techniques, pp. 413–420, ACM, 1994.
[56] C. Pelachaud, “Studies on gesture expressivity for a virtual agent,” Speech
Communication, vol. 51, no. 7, pp. 630–639, 2009.
43

BIBLIOGRAPHY 44
[57] J. C. Ward and A. L. Ostrom, “The internet as information mineﬁeld:
an analysis of the source and content of brand information yielded by net
searches,” Journal of Business research, vol. 56, no. 11, pp. 907–914, 2003.
[58] S. Bai, T. Zhu, and L. Cheng, “Big-ﬁve personality prediction based on user
behaviors at social network sites,” arXiv preprint arXiv:1204.4809, 2012.
[59] M. Smith, V. Barash, L. Getoor, and H. W. Lauw, “Leveraging social context
for searching social media,” in Proceedings of the 2008 ACM workshop on
Search in social media, pp. 91–94, ACM, 2008.
[60] A. Pak and P. Paroubek, “Twitter as a corpus for sentiment analysis and
opinion mining.,” in LREC, 2010.
44

User behavior model & recommendation on basis of social networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to User behavior model & recommendation on basis of social networks

Similar to User behavior model & recommendation on basis of social networks (20)

Recently uploaded

Recently uploaded (20)

User behavior model & recommendation on basis of social networks