Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Predicting the success of altruistic requests
1. predicting the success of altruistic requests
Sentiment analysis and machine learning approach
Author: Emanuele Pesce
Jacek Filipczuk
Supervisor: Prof. Sabrina Senatore
Aprile 2015
University of Salerno, department of computer science
0
2. outline
Introduction
Sentiment analysis
The problem: Random Act Of Pizza
Machine learning and sentiment extraction
Machine learning approach
Dataset and features
Sentiment extraction
Sentiment compression
Success frequency rate
Classification models
Results
Conclusions and future works
1
4. sentiment analysis: what is it?
What is sentiment analysis (also known as opinion mining)?
∙ The task of identifying positive, negative and neutral opinions and
emotions expressed in natural language
∙ It uses techniques like natural language processing, text analysis,
statistics, machine learning and others
3
5. sentiment analysis: polarity
What is it?
∙ Given a text discover how people feel reading it
∙ Determinate if the text contains emotional states such as ”angry” or ”happy”
∙ So, the polarity of a text can be:
∙ positive
∙ negative
∙ neutral
An example
∙ I love this movie, but I hate the director
∙ The sentence above is composed by:
∙ I love this movie, that has a positive score of polarity
∙ I hate the director, that has a negative score of polarity
∙ So it’s correct to say that the sentence, in its entirety, has a neutral polarity
4
6. sentiment analysis: domains
Often Sentiment Analysis is used in:
∙ Social media monitoring
∙ Voice of costumers to track customer reviews
∙ Survey response
∙ Business analytics
∙ Every situation in which text needs to be analyzed
5
7. predicting altruism through free pizza: the competition
∙ Predicting altruism through free pizza is a challenge launched by
Tim Althoff et all. on Kaggle
∙ Kaggle is a website which hosts competitions about machine
learning and computer science in generally
∙ The competition is based on Random Act Of Pizza
The Random Act of Pizza: what is it?
∙ It is a Reddit forum community, where users can make requests for
free pizza
∙ For example: ”I’ll write a poem, sing a song, do a dance, play an
instrument, whatever! I just want a pizza”
∙ If someone buys a pizza to the requester, the request would be
considered successful, if not that would be unsuccessful.
6
8. predicting altruism through free pizza: inputs and goals
Input
∙ The competition contains a dataset of textual requests for pizza
from the Reddit community Random Act Of Pizza
∙ For each sample of the dataset there are many informations
concerning both the request and the requester
Goal
∙ Given a post (or request), the goal is to predict if it will be
successful or unsuccessful
7
10. machine learning approach
∙ We have decided to adopt a machine learning approach to face
the challenge
∙ In figure 1 there is the workflow that describes the phases of this
work
Figure 1: workflow 9
11. dataset and features description
∙ The dataset contains 5671 textual requests for pizza
∙ Each sample of the dataset contains several informations:
∙ about the text of the content and the title of the request
∙ about the post of the request (number of comments, number of likes,
etc..)
∙ about user who did the request (age, publication date, etc..)
∙ a field that says if the request has been satisfying (pizza bought) or not
(So we have been using supervised learning algorithms)
∙ The dataset was in json format. We used Python to extract
information.
10
12. dataset: features about the post
∙ ”number_of_downvotes_of_request_at_retrieval”
∙ ”number_of_upvotes_of_request_at_retrieval”
∙ ”request_number_of_comments_at_retrieval”
∙ ”unix_timestamp_of_request_utc”
11
14. extracting information from title and text of requests
Texual features
∙ For each request the most important fields are textual: title and
request
∙ The features in the previous slides were almost all in numeric
format
∙ They can be used for computation, after an easy proper
preprocessing phase
∙ Different story for textual features..
Goal
Convert the textual features in numeric features, that contains
sentiment information, suitable to be given in input to a machine
learning algorithm
13
15. sentiment extraction from text
Textual features
∙ Text of the request
∙ Title of the request
To convert the text to computable features, we calculate two
measures:
∙ Sentiment compression: it is concerning the sentiment of the text
∙ Success frequency rate: it is concerning the rate of success of the
text
14
16. sentiment compression: ntlk polarity
We used NTLK’s API to get the polarity of a text
What NTLK returns
∙ Given a text NTLK returns three polarity values: positivity,
negativity, neutrality
∙ If the value of the neutral sentiment is greater than 0.5 the text is
labelled as neutral
∙ Else it is labelled as the greater between positivity and negativity,
whose values are correlated (their sum must be 1)
15
17. sentiment compression: sclabel
∙ We have compressed the three values taken by NTLK in a unique
value
∙ Let pPos and pNeu be the NTLK values associated (respectively) to
the positive and negative sentiment
SClabel = pPos · sign(0.5 − pNeu) (1)
∙ where sign function is so defined:
sign(x) =
{
−1 if x ≤ 0,
1 if x > 0.
∙ A unique value keeps the information about the positivity and the
polarity
16
18. sentiment compression: an example
A SClabel of -0.7 means that:
∙ it is neutral: because the sign is negative
∙ its positivity is 0.7 (so the negativity is 0.3)
17
19. success frequency rate
∙ We extract a new feature to find out the rate success of a post
∙ We realized a Bag of Words containing the most frequent words
which appear in the successful request
∙ For each word we keep track about how many times it has
appeared
∙ So we extract information about the success frequency rate from
a text in this way
succFrequency =
sum(frequencyWordInText · frequencyWordInBag)
lengthTtext
(2)
18
20. success frequency rate: an example
Given the text home sweet home, the success frequency rate is so
calculated:
succFrequency =
(2 ∗ frequencyWordInBag(home) + 1 ∗ frequencyWordInBag(sweet)
3
19
21. data matrix composition
Data matrix
So we have obtained a matrix [5671 x 25], where rows represent
samples (or requests) and columns represent features
Features selected
∙ 4 about the post (described previously)
∙ 17 about the requester (described previously)
∙ 2 about the sentiment of requests (SClabel[title] and SClabel[text])
∙ 2 about the success frequency rate of requests
(SuccFrequency[title] and SuccFrequency[text])
20
24. preprocessing
Normalization
To standardize the range of the features values we have been using
formula:
Xnew =
X − µ
σ
(3)
where X is a column of the data matrix, µ is the mean and σ is the
standard deviation
Outliers
∙ We consider as outliers values which differ 5· standard deviation
from the mean
∙ We removed those values
23
25. traning set and test set
Data matrix after preprocessing
We have obtained a matrix [5548 x 25], where rows represent
samples (or requests) and columns represent features
Training and test set
We divided the data with random sampling without repetition as
follow:
∙ traning set [3884 x 25] ≈ 70% of the data
∙ test set [1664 x 25] ≈ 30% of the data
24
26. classification models
After obtaining the features, we have obtained (on trained set data)
several classification models:
∙ Support vector machine
∙ Linear Kernel
∙ Gaussian Kernel
∙ Polynomial Kernel
∙ Spline Kernel
∙ Random forest
∙ k-nearest neighbors
∙ k values used = 1, 5, 15, 25, 51
∙ Naive Bayes
We tested each model (on test set data) in order to evaluate
perfomances
25
28. results classifiers
Figure 4: Accuracy, precision and recall for each classifier. SVM (linear
kernel) and Random Forest returned best performances.
27
29. accuracy
Figure 5: Accuracy of classifiers. Best performances were obtained from
Random Forest and SVM (linear kernel)
28
30. precision
Figure 6: Precision of classifiers. Best performances were obtained from
Random Forest and SVM (linear kernel)
29
31. recall
Figure 7: Recall of classifiers. The best performance was obtained from SVM
(linear kernel), followed by Random Forest
30
33. conclusions and future works
Perfomances
∙ Globally we can say that SVM and Random forest are the best
models to work with this dataset
∙ Best performances was been obtaining from Random forest:
∙ Accuracy ≈ 0.86
∙ Precision ≈ 0.83
∙ Recall ≈ 0.50
Future works
∙ Try to make classes more separable, for example introducing noise in
the space of the features
∙ Also consider synonyms in the bag of words before calculating the
frequency success rate
32