Java application for text classification in a social context. The aim of this
study is to develop an application able to retrieve and classify (using WEKA) a short text into either positive or negative, depending on the emotion of the writer. More specifically, the texts taken into analysis are tweets retrieved directly from Twitter.
2. Motivation
Sentiment analysis
Classification of the polarity of a given text in a
document, sentence or phrase
Goal: determine whether the expressed opinion is
positive or negative
Twitter
Microblogging tool, small sentences are less
ambiguous
Variable audience
Stock Market
Products opinion
Political elections
4. The corpus
Two datasets:
STS Stanford twitter corpus
Hand-labelled, different subjects
40000 labelled balanced tweets
Tweets from 2010
Auto generated
using smiles ad labels
Twitter request rate limits
5. Preprocessing
Remove RTs
English tweets
Remove URLs, mentions, numbers
Replace repeated characters
Replace emoticons by their polarity (auto generated database)
Have you heard about TEDx speech
? So great!by @yulia Soooin #Milan
https://www.ted.com/talks/insightful_
human_portraits_made_from_data
A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. More formally, if I want to calculate the probability of observing features f1f1 through fnfn, given some class c, under the Naive Bayes assumption the following holds:
p(f1,...,fn|c)=∏i=1np(fi|c)p(f1,...,fn|c)=∏i=1np(fi|c)
This means that when I want to use a Naive Bayes model to classify a new example, the posterior probability is much simpler to work with:
p(c|f1,...,fn)∝p(c)p(f1|c)...p(fn|c)p(c|f1,...,fn)∝p(c)p(f1|c)...p(fn|c)
Of course these assumptions of independence are rarely true, which may explain why some have referred to the model as the "Idiot Bayes" model, but in practice Naive Bayes models have performed surprisingly well, even on complex tasks where it is clear that the strong independence assumptions are false.
Up to this point we have said nothing about the distribution of each feature. In other words, we have left p(fi|c)p(fi|c) undefined. The term Multinomial Naive Bayes (generalizzazione binomial che solitamente conta solo una variabile e quindi due possibili risultati, in più variabili ognuna con la sua proprio probabilità – multinomial distribution for each feature) simply lets us know that each p(fi|c)p(fi|c) is a multinomial distribution, rather than some other distribution. This works well for data which can easily be turned into counts, such as word counts in text.