SlideShare a Scribd company logo
1 of 34
Download to read offline
predicting the success of altruistic requests
Sentiment analysis and machine learning approach
Author: Emanuele Pesce
Jacek Filipczuk
Supervisor: Prof. Sabrina Senatore
Aprile 2015
University of Salerno, department of computer science
0
outline
Introduction
Sentiment analysis
The problem: Random Act Of Pizza
Machine learning and sentiment extraction
Machine learning approach
Dataset and features
Sentiment extraction
Sentiment compression
Success frequency rate
Classification models
Results
Conclusions and future works
1
introduction
sentiment analysis: what is it?
What is sentiment analysis (also known as opinion mining)?
∙ The task of identifying positive, negative and neutral opinions and
emotions expressed in natural language
∙ It uses techniques like natural language processing, text analysis,
statistics, machine learning and others
3
sentiment analysis: polarity
What is it?
∙ Given a text discover how people feel reading it
∙ Determinate if the text contains emotional states such as ”angry” or ”happy”
∙ So, the polarity of a text can be:
∙ positive
∙ negative
∙ neutral
An example
∙ I love this movie, but I hate the director
∙ The sentence above is composed by:
∙ I love this movie, that has a positive score of polarity
∙ I hate the director, that has a negative score of polarity
∙ So it’s correct to say that the sentence, in its entirety, has a neutral polarity
4
sentiment analysis: domains
Often Sentiment Analysis is used in:
∙ Social media monitoring
∙ Voice of costumers to track customer reviews
∙ Survey response
∙ Business analytics
∙ Every situation in which text needs to be analyzed
5
predicting altruism through free pizza: the competition
∙ Predicting altruism through free pizza is a challenge launched by
Tim Althoff et all. on Kaggle
∙ Kaggle is a website which hosts competitions about machine
learning and computer science in generally
∙ The competition is based on Random Act Of Pizza
The Random Act of Pizza: what is it?
∙ It is a Reddit forum community, where users can make requests for
free pizza
∙ For example: ”I’ll write a poem, sing a song, do a dance, play an
instrument, whatever! I just want a pizza”
∙ If someone buys a pizza to the requester, the request would be
considered successful, if not that would be unsuccessful.
6
predicting altruism through free pizza: inputs and goals
Input
∙ The competition contains a dataset of textual requests for pizza
from the Reddit community Random Act Of Pizza
∙ For each sample of the dataset there are many informations
concerning both the request and the requester
Goal
∙ Given a post (or request), the goal is to predict if it will be
successful or unsuccessful
7
machine learning and sentiment
extraction
machine learning approach
∙ We have decided to adopt a machine learning approach to face
the challenge
∙ In figure 1 there is the workflow that describes the phases of this
work
Figure 1: workflow 9
dataset and features description
∙ The dataset contains 5671 textual requests for pizza
∙ Each sample of the dataset contains several informations:
∙ about the text of the content and the title of the request
∙ about the post of the request (number of comments, number of likes,
etc..)
∙ about user who did the request (age, publication date, etc..)
∙ a field that says if the request has been satisfying (pizza bought) or not
(So we have been using supervised learning algorithms)
∙ The dataset was in json format. We used Python to extract
information.
10
dataset: features about the post
∙ ”number_of_downvotes_of_request_at_retrieval”
∙ ”number_of_upvotes_of_request_at_retrieval”
∙ ”request_number_of_comments_at_retrieval”
∙ ”unix_timestamp_of_request_utc”
11
dataset: features about the requester
∙ ”requester_account_age_in_days_at_request”
∙ ”requester_account_age_in_days_at_retrieval”
∙ ”requester_days_since_first_post_on_raop_at_request”
∙ ”requester_number_of_comments_at_request”
∙ ”requester_number_of_comments_at_retrieval”
∙ ”requester_number_of_comments_in_raop_at_request”
∙ ”requester_number_of_comments_in_raop_at_retrieval”
∙ ”requester_number_of_posts_at_request”
∙ ”requester_number_of_posts_at_retrieval”
∙ ”requester_number_of_posts_on_raop_at_request”
∙ ”requester_number_of_posts_on_raop_at_retrieval”
∙ ”requester_number_of_subreddits_at_request”
∙ ”requester_subreddits_at_request”
∙ ”requester_upvotes_minus_downvotes_at_request”
∙ ”requester_upvotes_minus_downvotes_at_retrieval”
∙ ”requester_upvotes_plus_downvotes_at_request”
∙ ”requester_upvotes_plus_downvotes_at_retrieval”
12
extracting information from title and text of requests
Texual features
∙ For each request the most important fields are textual: title and
request
∙ The features in the previous slides were almost all in numeric
format
∙ They can be used for computation, after an easy proper
preprocessing phase
∙ Different story for textual features..
Goal
Convert the textual features in numeric features, that contains
sentiment information, suitable to be given in input to a machine
learning algorithm
13
sentiment extraction from text
Textual features
∙ Text of the request
∙ Title of the request
To convert the text to computable features, we calculate two
measures:
∙ Sentiment compression: it is concerning the sentiment of the text
∙ Success frequency rate: it is concerning the rate of success of the
text
14
sentiment compression: ntlk polarity
We used NTLK’s API to get the polarity of a text
What NTLK returns
∙ Given a text NTLK returns three polarity values: positivity,
negativity, neutrality
∙ If the value of the neutral sentiment is greater than 0.5 the text is
labelled as neutral
∙ Else it is labelled as the greater between positivity and negativity,
whose values are correlated (their sum must be 1)
15
sentiment compression: sclabel
∙ We have compressed the three values taken by NTLK in a unique
value
∙ Let pPos and pNeu be the NTLK values associated (respectively) to
the positive and negative sentiment
SClabel = pPos · sign(0.5 − pNeu) (1)
∙ where sign function is so defined:
sign(x) =
{
−1 if x ≤ 0,
1 if x > 0.
∙ A unique value keeps the information about the positivity and the
polarity
16
sentiment compression: an example
A SClabel of -0.7 means that:
∙ it is neutral: because the sign is negative
∙ its positivity is 0.7 (so the negativity is 0.3)
17
success frequency rate
∙ We extract a new feature to find out the rate success of a post
∙ We realized a Bag of Words containing the most frequent words
which appear in the successful request
∙ For each word we keep track about how many times it has
appeared
∙ So we extract information about the success frequency rate from
a text in this way
succFrequency =
sum(frequencyWordInText · frequencyWordInBag)
lengthTtext
(2)
18
success frequency rate: an example
Given the text home sweet home, the success frequency rate is so
calculated:
succFrequency =
(2 ∗ frequencyWordInBag(home) + 1 ∗ frequencyWordInBag(sweet)
3
19
data matrix composition
Data matrix
So we have obtained a matrix [5671 x 25], where rows represent
samples (or requests) and columns represent features
Features selected
∙ 4 about the post (described previously)
∙ 17 about the requester (described previously)
∙ 2 about the sentiment of requests (SClabel[title] and SClabel[text])
∙ 2 about the success frequency rate of requests
(SuccFrequency[title] and SuccFrequency[text])
20
dataset 2d visualization
Figure 2: 2d dataset visualization, with MDS
21
dataset 3d visualization
Figure 3: 3d dataset visualization, with MDS
22
preprocessing
Normalization
To standardize the range of the features values we have been using
formula:
Xnew =
X − µ
σ
(3)
where X is a column of the data matrix, µ is the mean and σ is the
standard deviation
Outliers
∙ We consider as outliers values which differ 5· standard deviation
from the mean
∙ We removed those values
23
traning set and test set
Data matrix after preprocessing
We have obtained a matrix [5548 x 25], where rows represent
samples (or requests) and columns represent features
Training and test set
We divided the data with random sampling without repetition as
follow:
∙ traning set [3884 x 25] ≈ 70% of the data
∙ test set [1664 x 25] ≈ 30% of the data
24
classification models
After obtaining the features, we have obtained (on trained set data)
several classification models:
∙ Support vector machine
∙ Linear Kernel
∙ Gaussian Kernel
∙ Polynomial Kernel
∙ Spline Kernel
∙ Random forest
∙ k-nearest neighbors
∙ k values used = 1, 5, 15, 25, 51
∙ Naive Bayes
We tested each model (on test set data) in order to evaluate
perfomances
25
results
results classifiers
Figure 4: Accuracy, precision and recall for each classifier. SVM (linear
kernel) and Random Forest returned best performances.
27
accuracy
Figure 5: Accuracy of classifiers. Best performances were obtained from
Random Forest and SVM (linear kernel)
28
precision
Figure 6: Precision of classifiers. Best performances were obtained from
Random Forest and SVM (linear kernel)
29
recall
Figure 7: Recall of classifiers. The best performance was obtained from SVM
(linear kernel), followed by Random Forest
30
conclusions and future works
conclusions and future works
Perfomances
∙ Globally we can say that SVM and Random forest are the best
models to work with this dataset
∙ Best performances was been obtaining from Random forest:
∙ Accuracy ≈ 0.86
∙ Precision ≈ 0.83
∙ Recall ≈ 0.50
Future works
∙ Try to make classes more separable, for example introducing noise in
the space of the features
∙ Also consider synonyms in the bag of words before calculating the
frequency success rate
32
Thank you for your attention!
33

More Related Content

Similar to Predicting the success of altruistic requests

Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptShivaShiva783981
 
Data mining 6 klasifikasi naive bayes classifier
Data mining 6   klasifikasi naive bayes classifierData mining 6   klasifikasi naive bayes classifier
Data mining 6 klasifikasi naive bayes classifierIrwansyahSaputra1
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
Solving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithmSolving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithmLahiru Dilshan
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment systemKOYELMAJUMDAR1
 
Causal reasoning and Learning Systems
Causal reasoning and Learning SystemsCausal reasoning and Learning Systems
Causal reasoning and Learning SystemsTrieu Nguyen
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with TensorflowShubham Sharma
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsIstituto nazionale di statistica
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
Artificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdfArtificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdfEleonora Ciceri
 
An introduction to machine learning
An introduction to machine learningAn introduction to machine learning
An introduction to machine learningAvinash Kumar
 
Optimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOEOptimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOEYelp Engineering
 
Artificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfArtificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfJayanti Prasad Ph.D.
 
Cognitive Self-Synchronisation
Cognitive Self-SynchronisationCognitive Self-Synchronisation
Cognitive Self-SynchronisationMarco Manso
 

Similar to Predicting the success of altruistic requests (20)

Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
 
Data mining 6 klasifikasi naive bayes classifier
Data mining 6   klasifikasi naive bayes classifierData mining 6   klasifikasi naive bayes classifier
Data mining 6 klasifikasi naive bayes classifier
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Solving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithmSolving non linear programming minimization problem using genetic algorithm
Solving non linear programming minimization problem using genetic algorithm
 
Week 1.pdf
Week 1.pdfWeek 1.pdf
Week 1.pdf
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
06 fp basic
06 fp basic06 fp basic
06 fp basic
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
Causal reasoning and Learning Systems
Causal reasoning and Learning SystemsCausal reasoning and Learning Systems
Causal reasoning and Learning Systems
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with Tensorflow
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Artificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdfArtificial Intelligence: an introduction.pdf
Artificial Intelligence: an introduction.pdf
 
An introduction to machine learning
An introduction to machine learningAn introduction to machine learning
An introduction to machine learning
 
Optimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOEOptimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOE
 
Artificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdfArtificial Intelligence - Anna Uni -v1.pdf
Artificial Intelligence - Anna Uni -v1.pdf
 
Cognitive Self-Synchronisation
Cognitive Self-SynchronisationCognitive Self-Synchronisation
Cognitive Self-Synchronisation
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 

Recently uploaded

A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einsteinxgamestudios8
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfmarcuskenyatta275
 
PARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th semPARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th semborkhotudu123
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoonintarciacompanies
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfStart Project
 
Classification of Kerogen, Perspective on palynofacies in depositional envi...
Classification of Kerogen,  Perspective on palynofacies in depositional  envi...Classification of Kerogen,  Perspective on palynofacies in depositional  envi...
Classification of Kerogen, Perspective on palynofacies in depositional envi...Sangram Sahoo
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...yogeshlabana357357
 
Introduction and significance of Symbiotic algae
Introduction and significance of  Symbiotic algaeIntroduction and significance of  Symbiotic algae
Introduction and significance of Symbiotic algaekushbuR
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonAftabAhmedRahimoon
 
Micropropagation of Madagascar periwinkle (Catharanthus roseus)
Micropropagation of Madagascar periwinkle (Catharanthus roseus)Micropropagation of Madagascar periwinkle (Catharanthus roseus)
Micropropagation of Madagascar periwinkle (Catharanthus roseus)adityawani683
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandRcvets
 
Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentslevieagacer
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysBrahmesh Reddy B R
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptxyoussefboujtat3
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxGlendelCaroz
 
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)kushbuR
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationAreesha Ahmad
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Ansari Aashif Raza Mohd Imtiyaz
 

Recently uploaded (20)

ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einstein
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 
PARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th semPARENTAL CARE IN FISHES.pptx for 5th sem
PARENTAL CARE IN FISHES.pptx for 5th sem
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
Classification of Kerogen, Perspective on palynofacies in depositional envi...
Classification of Kerogen,  Perspective on palynofacies in depositional  envi...Classification of Kerogen,  Perspective on palynofacies in depositional  envi...
Classification of Kerogen, Perspective on palynofacies in depositional envi...
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
Introduction and significance of Symbiotic algae
Introduction and significance of  Symbiotic algaeIntroduction and significance of  Symbiotic algae
Introduction and significance of Symbiotic algae
 
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPTHIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
Micropropagation of Madagascar periwinkle (Catharanthus roseus)
Micropropagation of Madagascar periwinkle (Catharanthus roseus)Micropropagation of Madagascar periwinkle (Catharanthus roseus)
Micropropagation of Madagascar periwinkle (Catharanthus roseus)
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 students
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptx
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolation
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 

Predicting the success of altruistic requests

  • 1. predicting the success of altruistic requests Sentiment analysis and machine learning approach Author: Emanuele Pesce Jacek Filipczuk Supervisor: Prof. Sabrina Senatore Aprile 2015 University of Salerno, department of computer science 0
  • 2. outline Introduction Sentiment analysis The problem: Random Act Of Pizza Machine learning and sentiment extraction Machine learning approach Dataset and features Sentiment extraction Sentiment compression Success frequency rate Classification models Results Conclusions and future works 1
  • 4. sentiment analysis: what is it? What is sentiment analysis (also known as opinion mining)? ∙ The task of identifying positive, negative and neutral opinions and emotions expressed in natural language ∙ It uses techniques like natural language processing, text analysis, statistics, machine learning and others 3
  • 5. sentiment analysis: polarity What is it? ∙ Given a text discover how people feel reading it ∙ Determinate if the text contains emotional states such as ”angry” or ”happy” ∙ So, the polarity of a text can be: ∙ positive ∙ negative ∙ neutral An example ∙ I love this movie, but I hate the director ∙ The sentence above is composed by: ∙ I love this movie, that has a positive score of polarity ∙ I hate the director, that has a negative score of polarity ∙ So it’s correct to say that the sentence, in its entirety, has a neutral polarity 4
  • 6. sentiment analysis: domains Often Sentiment Analysis is used in: ∙ Social media monitoring ∙ Voice of costumers to track customer reviews ∙ Survey response ∙ Business analytics ∙ Every situation in which text needs to be analyzed 5
  • 7. predicting altruism through free pizza: the competition ∙ Predicting altruism through free pizza is a challenge launched by Tim Althoff et all. on Kaggle ∙ Kaggle is a website which hosts competitions about machine learning and computer science in generally ∙ The competition is based on Random Act Of Pizza The Random Act of Pizza: what is it? ∙ It is a Reddit forum community, where users can make requests for free pizza ∙ For example: ”I’ll write a poem, sing a song, do a dance, play an instrument, whatever! I just want a pizza” ∙ If someone buys a pizza to the requester, the request would be considered successful, if not that would be unsuccessful. 6
  • 8. predicting altruism through free pizza: inputs and goals Input ∙ The competition contains a dataset of textual requests for pizza from the Reddit community Random Act Of Pizza ∙ For each sample of the dataset there are many informations concerning both the request and the requester Goal ∙ Given a post (or request), the goal is to predict if it will be successful or unsuccessful 7
  • 9. machine learning and sentiment extraction
  • 10. machine learning approach ∙ We have decided to adopt a machine learning approach to face the challenge ∙ In figure 1 there is the workflow that describes the phases of this work Figure 1: workflow 9
  • 11. dataset and features description ∙ The dataset contains 5671 textual requests for pizza ∙ Each sample of the dataset contains several informations: ∙ about the text of the content and the title of the request ∙ about the post of the request (number of comments, number of likes, etc..) ∙ about user who did the request (age, publication date, etc..) ∙ a field that says if the request has been satisfying (pizza bought) or not (So we have been using supervised learning algorithms) ∙ The dataset was in json format. We used Python to extract information. 10
  • 12. dataset: features about the post ∙ ”number_of_downvotes_of_request_at_retrieval” ∙ ”number_of_upvotes_of_request_at_retrieval” ∙ ”request_number_of_comments_at_retrieval” ∙ ”unix_timestamp_of_request_utc” 11
  • 13. dataset: features about the requester ∙ ”requester_account_age_in_days_at_request” ∙ ”requester_account_age_in_days_at_retrieval” ∙ ”requester_days_since_first_post_on_raop_at_request” ∙ ”requester_number_of_comments_at_request” ∙ ”requester_number_of_comments_at_retrieval” ∙ ”requester_number_of_comments_in_raop_at_request” ∙ ”requester_number_of_comments_in_raop_at_retrieval” ∙ ”requester_number_of_posts_at_request” ∙ ”requester_number_of_posts_at_retrieval” ∙ ”requester_number_of_posts_on_raop_at_request” ∙ ”requester_number_of_posts_on_raop_at_retrieval” ∙ ”requester_number_of_subreddits_at_request” ∙ ”requester_subreddits_at_request” ∙ ”requester_upvotes_minus_downvotes_at_request” ∙ ”requester_upvotes_minus_downvotes_at_retrieval” ∙ ”requester_upvotes_plus_downvotes_at_request” ∙ ”requester_upvotes_plus_downvotes_at_retrieval” 12
  • 14. extracting information from title and text of requests Texual features ∙ For each request the most important fields are textual: title and request ∙ The features in the previous slides were almost all in numeric format ∙ They can be used for computation, after an easy proper preprocessing phase ∙ Different story for textual features.. Goal Convert the textual features in numeric features, that contains sentiment information, suitable to be given in input to a machine learning algorithm 13
  • 15. sentiment extraction from text Textual features ∙ Text of the request ∙ Title of the request To convert the text to computable features, we calculate two measures: ∙ Sentiment compression: it is concerning the sentiment of the text ∙ Success frequency rate: it is concerning the rate of success of the text 14
  • 16. sentiment compression: ntlk polarity We used NTLK’s API to get the polarity of a text What NTLK returns ∙ Given a text NTLK returns three polarity values: positivity, negativity, neutrality ∙ If the value of the neutral sentiment is greater than 0.5 the text is labelled as neutral ∙ Else it is labelled as the greater between positivity and negativity, whose values are correlated (their sum must be 1) 15
  • 17. sentiment compression: sclabel ∙ We have compressed the three values taken by NTLK in a unique value ∙ Let pPos and pNeu be the NTLK values associated (respectively) to the positive and negative sentiment SClabel = pPos · sign(0.5 − pNeu) (1) ∙ where sign function is so defined: sign(x) = { −1 if x ≤ 0, 1 if x > 0. ∙ A unique value keeps the information about the positivity and the polarity 16
  • 18. sentiment compression: an example A SClabel of -0.7 means that: ∙ it is neutral: because the sign is negative ∙ its positivity is 0.7 (so the negativity is 0.3) 17
  • 19. success frequency rate ∙ We extract a new feature to find out the rate success of a post ∙ We realized a Bag of Words containing the most frequent words which appear in the successful request ∙ For each word we keep track about how many times it has appeared ∙ So we extract information about the success frequency rate from a text in this way succFrequency = sum(frequencyWordInText · frequencyWordInBag) lengthTtext (2) 18
  • 20. success frequency rate: an example Given the text home sweet home, the success frequency rate is so calculated: succFrequency = (2 ∗ frequencyWordInBag(home) + 1 ∗ frequencyWordInBag(sweet) 3 19
  • 21. data matrix composition Data matrix So we have obtained a matrix [5671 x 25], where rows represent samples (or requests) and columns represent features Features selected ∙ 4 about the post (described previously) ∙ 17 about the requester (described previously) ∙ 2 about the sentiment of requests (SClabel[title] and SClabel[text]) ∙ 2 about the success frequency rate of requests (SuccFrequency[title] and SuccFrequency[text]) 20
  • 22. dataset 2d visualization Figure 2: 2d dataset visualization, with MDS 21
  • 23. dataset 3d visualization Figure 3: 3d dataset visualization, with MDS 22
  • 24. preprocessing Normalization To standardize the range of the features values we have been using formula: Xnew = X − µ σ (3) where X is a column of the data matrix, µ is the mean and σ is the standard deviation Outliers ∙ We consider as outliers values which differ 5· standard deviation from the mean ∙ We removed those values 23
  • 25. traning set and test set Data matrix after preprocessing We have obtained a matrix [5548 x 25], where rows represent samples (or requests) and columns represent features Training and test set We divided the data with random sampling without repetition as follow: ∙ traning set [3884 x 25] ≈ 70% of the data ∙ test set [1664 x 25] ≈ 30% of the data 24
  • 26. classification models After obtaining the features, we have obtained (on trained set data) several classification models: ∙ Support vector machine ∙ Linear Kernel ∙ Gaussian Kernel ∙ Polynomial Kernel ∙ Spline Kernel ∙ Random forest ∙ k-nearest neighbors ∙ k values used = 1, 5, 15, 25, 51 ∙ Naive Bayes We tested each model (on test set data) in order to evaluate perfomances 25
  • 28. results classifiers Figure 4: Accuracy, precision and recall for each classifier. SVM (linear kernel) and Random Forest returned best performances. 27
  • 29. accuracy Figure 5: Accuracy of classifiers. Best performances were obtained from Random Forest and SVM (linear kernel) 28
  • 30. precision Figure 6: Precision of classifiers. Best performances were obtained from Random Forest and SVM (linear kernel) 29
  • 31. recall Figure 7: Recall of classifiers. The best performance was obtained from SVM (linear kernel), followed by Random Forest 30
  • 33. conclusions and future works Perfomances ∙ Globally we can say that SVM and Random forest are the best models to work with this dataset ∙ Best performances was been obtaining from Random forest: ∙ Accuracy ≈ 0.86 ∙ Precision ≈ 0.83 ∙ Recall ≈ 0.50 Future works ∙ Try to make classes more separable, for example introducing noise in the space of the features ∙ Also consider synonyms in the bag of words before calculating the frequency success rate 32
  • 34. Thank you for your attention! 33