EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Deceptive Spam
1. Review Spam Classification
Tarek Amr – University of East Anglia
Introduction
References
Mutual Information Size of Dataset k-NN Continued Conclusion
Top-10 terms with highest MI
- According to [Joachims-1996], Rocchio excels
when training data is smaller.
- However, its improvement does not increase
with the same rate as Naive Bayes
- We trained Rocchio (Cosine distance) and
Naive Bayes (MV) on subsets of our data, and
plotted the results:
- As stated by [Han-2000]:
“A major drawback of the similarity measure
used in k-NN is that it uses all features
equally in computing similarities. This can
lead to poor similarity measures and
classification errors, when only a small
subset of the words is useful for
classification”.
- Below you can see the classification
accuracy in percentage for different values of
k, using different features
A Probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization. [Joachims-
1996]
Centroid-based document classification: Analysis
and experimental results. [Han-2000]
Grammatical word class variation within the
British National Corpus Sampler. [Rayson-2001]
Finding deceptive opinion spam by any stretch of
the imagination [Ott-2011]
- Detecting Review Spam
- Classification Algorithms:
• Naive Bayes
• Multinomial
• Multivariate (Bernoulli)
• Rocchio (Cosine/Euclidean)
• K-Nearest Neighbour (C/E)
- Preprocessors / Feature Selection
• N-gram Tokenizer
• Stemming* (Porter/Lancaster)
• Part of Speech Tagger*
• Pruning of infrequent words
• Mutual Information**
- Results Evaluation
• Accuracy
• Precision / Recall
• F-Score (a=1/2 => 2PR/(P+R))
* NLTK package was used ** Stand-alone
Feature Selection
[Joachims-1996] listed 3 steps for feature
selection:
- Pruning of infrequent words. (3+ times)
- Pruning of high frequent words. (Stop word)
- Choosing words with high Mutual Information.
Naive Bayes (Pruning of infrequent words)
- Multivariate: ↑ Accuracy (87.63% => 87.88%)
- Not statistically significant. (p = 0.58 >> 0.05)
- Same for Precision and Recall
- Multinomial: ↓ Accuracy (88.5% => 87.88)
Rocchio (Pruning of infrequent words)
- Steady till frequency < 7, then degradation
- My interpretation (Scientific!?)
- Truncating *shallow* axises in Vector Space!
- Centroid already not able to move much there. - Twitter dataset (@AppleNws and @NokiaUS)
- 5 folds x 40 tweets
- Apple is a bot.
- Nokia used 'You', 'Your' and 'RT' more
- Nokia uses more personal pronouns, whereas
Apple uses more Hashtags
NB (97.47%), Rocchio (92.47), NB/PoS (85.91%)
- Similar to [Ott-2011] findings using LIWC
- Almost same term-rank with Porter stemmer.
- Rocchio just went from 78.25% to 78.5% with
porter stemmer (p >> 0.05)
- Somehow, bi-grams and tri-grams ranks didn't
change a lot from uni-grams
'michigan ave' vs 'michigan', 'the floor' vs 'floor',
'husband and' and 'my husband' vs 'husband', etc.
- Removing stop words!?
- Rocchio results for unigrams (78.25%), bigrams
(81.125%) and trigrams (78.625%) [p = 0.178]
- We also agreed with [Rayson-2001] and [Ott-
2011] regarding (Truthful) PoS tags K-Nearest Neighbor
- We got best result (Accuracy =73.875%) when
k was set 105.
- Notice: We set k = k – 1, if k is even number.
- Notice how accuracy goes to 50% when k =
number of documents (we have equal number of
Truthful and Deceptive documents)
Results
Average Accuracy:
- Naive Bayes [Muli-Variate, Terms] = 87.625 %
- Naive Bayes [Muli-Nomial, Terms] = 88.5 %
- Rocchio [Cosine, Terms] = 78.25 %
- Rocchio [Cosine, Bigrams] 81.125 %
- KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 %
Naive Bayes MV has slightly better recall than
NM (0.92 @ p=0.18), while MN has better slightly
precision (0.88 @ p=0.012)
They both are much more precise than Rocchio
(p < 0.01), and have better recall too (p < 0.05)
However, as we have seen earlier, Rocchio
excels, trained on fewer data
- Statistical nature of text varies from one dataset
to the other, and results vary accordingly.
- Naive Bayes outperformed TFIDF Algorithms.
- With fewer data, Rocchio outperforms NB.
- kNN is resource intensive, especially in testing.
- Feature selection is more suitable for both
Naive Bayes MV and kNN.
- Mutual Information helps visualizing our data,
let alone its use for Feature Selection.
- Would be better to try combining MI into our
Classifiers and check results accordingly.
- Stemming and n-grams did not offer any
significant improvement, due to the nature of the
top informative terms.
- Our results for PoS using Rocchio and NB were
far away from SVM/PoS results