Predicting the success of altruistic requests

predicting the success of altruistic requests
Sentiment analysis and machine learning approach
Author: Emanuele Pesce
Jacek Filipczuk
Supervisor: Prof. Sabrina Senatore
Aprile 2015
University of Salerno, department of computer science
0

outline
Introduction
Sentiment analysis
The problem: Random Act Of Pizza
Machine learning and sentiment extraction
Machine learning approach
Dataset and features
Sentiment extraction
Sentiment compression
Success frequency rate
Classiﬁcation models
Results
Conclusions and future works
1

sentiment analysis: what is it?
What is sentiment analysis (also known as opinion mining)?
∙ The task of identifying positive, negative and neutral opinions and
emotions expressed in natural language
∙ It uses techniques like natural language processing, text analysis,
statistics, machine learning and others
3

sentiment analysis: polarity
What is it?
∙ Given a text discover how people feel reading it
∙ Determinate if the text contains emotional states such as ”angry” or ”happy”
∙ So, the polarity of a text can be:
∙ positive
∙ negative
∙ neutral
An example
∙ I love this movie, but I hate the director
∙ The sentence above is composed by:
∙ I love this movie, that has a positive score of polarity
∙ I hate the director, that has a negative score of polarity
∙ So it’s correct to say that the sentence, in its entirety, has a neutral polarity
4

sentiment analysis: domains
Often Sentiment Analysis is used in:
∙ Social media monitoring
∙ Voice of costumers to track customer reviews
∙ Survey response
∙ Business analytics
∙ Every situation in which text needs to be analyzed
5

predicting altruism through free pizza: the competition
∙ Predicting altruism through free pizza is a challenge launched by
Tim Althoff et all. on Kaggle
∙ Kaggle is a website which hosts competitions about machine
learning and computer science in generally
∙ The competition is based on Random Act Of Pizza
The Random Act of Pizza: what is it?
∙ It is a Reddit forum community, where users can make requests for
free pizza
∙ For example: ”I’ll write a poem, sing a song, do a dance, play an
instrument, whatever! I just want a pizza”
∙ If someone buys a pizza to the requester, the request would be
considered successful, if not that would be unsuccessful.
6

predicting altruism through free pizza: inputs and goals
Input
∙ The competition contains a dataset of textual requests for pizza
from the Reddit community Random Act Of Pizza
∙ For each sample of the dataset there are many informations
concerning both the request and the requester
Goal
∙ Given a post (or request), the goal is to predict if it will be
successful or unsuccessful
7

machine learning and sentiment
extraction

machine learning approach
∙ We have decided to adopt a machine learning approach to face
the challenge
∙ In figure 1 there is the workflow that describes the phases of this
work
Figure 1: workflow 9

dataset and features description
∙ The dataset contains 5671 textual requests for pizza
∙ Each sample of the dataset contains several informations:
∙ about the text of the content and the title of the request
∙ about the post of the request (number of comments, number of likes,
etc..)
∙ about user who did the request (age, publication date, etc..)
∙ a ﬁeld that says if the request has been satisfying (pizza bought) or not
(So we have been using supervised learning algorithms)
∙ The dataset was in json format. We used Python to extract
information.
10

dataset: features about the post
∙ ”number_of_downvotes_of_request_at_retrieval”
∙ ”number_of_upvotes_of_request_at_retrieval”
∙ ”request_number_of_comments_at_retrieval”
∙ ”unix_timestamp_of_request_utc”
11

dataset: features about the requester
∙ ”requester_account_age_in_days_at_request”
∙ ”requester_account_age_in_days_at_retrieval”
∙ ”requester_days_since_ﬁrst_post_on_raop_at_request”
∙ ”requester_number_of_comments_at_request”
∙ ”requester_number_of_comments_at_retrieval”
∙ ”requester_number_of_comments_in_raop_at_request”
∙ ”requester_number_of_comments_in_raop_at_retrieval”
∙ ”requester_number_of_posts_at_request”
∙ ”requester_number_of_posts_at_retrieval”
∙ ”requester_number_of_posts_on_raop_at_request”
∙ ”requester_number_of_posts_on_raop_at_retrieval”
∙ ”requester_number_of_subreddits_at_request”
∙ ”requester_subreddits_at_request”
∙ ”requester_upvotes_minus_downvotes_at_request”
∙ ”requester_upvotes_minus_downvotes_at_retrieval”
∙ ”requester_upvotes_plus_downvotes_at_request”
∙ ”requester_upvotes_plus_downvotes_at_retrieval”
12

extracting information from title and text of requests
Texual features
∙ For each request the most important ﬁelds are textual: title and
request
∙ The features in the previous slides were almost all in numeric
format
∙ They can be used for computation, after an easy proper
preprocessing phase
∙ Different story for textual features..
Goal
Convert the textual features in numeric features, that contains
sentiment information, suitable to be given in input to a machine
learning algorithm
13

sentiment extraction from text
Textual features
∙ Text of the request
∙ Title of the request
To convert the text to computable features, we calculate two
measures:
∙ Sentiment compression: it is concerning the sentiment of the text
∙ Success frequency rate: it is concerning the rate of success of the
text
14

sentiment compression: ntlk polarity
We used NTLK’s API to get the polarity of a text
What NTLK returns
∙ Given a text NTLK returns three polarity values: positivity,
negativity, neutrality
∙ If the value of the neutral sentiment is greater than 0.5 the text is
labelled as neutral
∙ Else it is labelled as the greater between positivity and negativity,
whose values are correlated (their sum must be 1)
15

sentiment compression: sclabel
∙ We have compressed the three values taken by NTLK in a unique
value
∙ Let pPos and pNeu be the NTLK values associated (respectively) to
the positive and negative sentiment
SClabel = pPos · sign(0.5 − pNeu) (1)
∙ where sign function is so deﬁned:
sign(x) =
{
−1 if x ≤ 0,
1 if x > 0.
∙ A unique value keeps the information about the positivity and the
polarity
16

sentiment compression: an example
A SClabel of -0.7 means that:
∙ it is neutral: because the sign is negative
∙ its positivity is 0.7 (so the negativity is 0.3)
17

success frequency rate
∙ We extract a new feature to ﬁnd out the rate success of a post
∙ We realized a Bag of Words containing the most frequent words
which appear in the successful request
∙ For each word we keep track about how many times it has
appeared
∙ So we extract information about the success frequency rate from
a text in this way
succFrequency =
sum(frequencyWordInText · frequencyWordInBag)
lengthTtext
(2)
18

success frequency rate: an example
Given the text home sweet home, the success frequency rate is so
calculated:
succFrequency =
(2 ∗ frequencyWordInBag(home) + 1 ∗ frequencyWordInBag(sweet)
3
19

data matrix composition
Data matrix
So we have obtained a matrix [5671 x 25], where rows represent
samples (or requests) and columns represent features
Features selected
∙ 4 about the post (described previously)
∙ 17 about the requester (described previously)
∙ 2 about the sentiment of requests (SClabel[title] and SClabel[text])
∙ 2 about the success frequency rate of requests
(SuccFrequency[title] and SuccFrequency[text])
20

dataset 2d visualization
Figure 2: 2d dataset visualization, with MDS
21

dataset 3d visualization
Figure 3: 3d dataset visualization, with MDS
22

preprocessing
Normalization
To standardize the range of the features values we have been using
formula:
Xnew =
X − µ
σ
(3)
where X is a column of the data matrix, µ is the mean and σ is the
standard deviation
Outliers
∙ We consider as outliers values which differ 5· standard deviation
from the mean
∙ We removed those values
23

traning set and test set
Data matrix after preprocessing
We have obtained a matrix [5548 x 25], where rows represent
samples (or requests) and columns represent features
Training and test set
We divided the data with random sampling without repetition as
follow:
∙ traning set [3884 x 25] ≈ 70% of the data
∙ test set [1664 x 25] ≈ 30% of the data
24

classification models
After obtaining the features, we have obtained (on trained set data)
several classiﬁcation models:
∙ Support vector machine
∙ Linear Kernel
∙ Gaussian Kernel
∙ Polynomial Kernel
∙ Spline Kernel
∙ Random forest
∙ k-nearest neighbors
∙ k values used = 1, 5, 15, 25, 51
∙ Naive Bayes
We tested each model (on test set data) in order to evaluate
perfomances
25

results classifiers
Figure 4: Accuracy, precision and recall for each classiﬁer. SVM (linear
kernel) and Random Forest returned best performances.
27

accuracy
Figure 5: Accuracy of classiﬁers. Best performances were obtained from
Random Forest and SVM (linear kernel)
28

precision
Figure 6: Precision of classiﬁers. Best performances were obtained from
Random Forest and SVM (linear kernel)
29

recall
Figure 7: Recall of classiﬁers. The best performance was obtained from SVM
(linear kernel), followed by Random Forest
30

conclusions and future works
Perfomances
∙ Globally we can say that SVM and Random forest are the best
models to work with this dataset
∙ Best performances was been obtaining from Random forest:
∙ Accuracy ≈ 0.86
∙ Precision ≈ 0.83
∙ Recall ≈ 0.50
Future works
∙ Try to make classes more separable, for example introducing noise in
the space of the features
∙ Also consider synonyms in the bag of words before calculating the
frequency success rate
32

Thank you for your attention!
33

Predicting the success of altruistic requests

Recommended

Recommended

More Related Content

Similar to Predicting the success of altruistic requests

Similar to Predicting the success of altruistic requests (20)

Recently uploaded

Recently uploaded (20)

Predicting the success of altruistic requests