Deceptive spam


Published on

As a part of my Information Retrieval module in the University of East Anglia, we had to build classifier to detect deceptive review spam. Review spam was described by Nitin Jindal as follows: "It is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products .. Unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions". This is my poster presentation where I implemented 3 classification algorithms using Python, as well as feature selection and preprocessor modules.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Deceptive spam

  1. 1. Review Spam Classification Tarek Amr – University of East Anglia Introduction Mutual Information Size of Dataset k-NN Continued Conclusion - Detecting Review Spam Top-10 terms with highest MI - Statistical nature of text varies from one dataset - According to [Joachims-1996], Rocchio excels - As stated by [Han-2000]: - Classification Algorithms: to the other, and results vary accordingly. when training data is smaller. • Naive Bayes “A major drawback of the similarity measure • Multinomial - Naive Bayes outperformed TFIDF Algorithms. - However, its improvement does not increase used in k-NN is that it uses all features • Multivariate (Bernoulli) with the same rate as Naive Bayes equally in computing similarities. This can - With fewer data, Rocchio outperforms NB. • Rocchio (Cosine/Euclidean) lead to poor similarity measures and • K-Nearest Neighbour (C/E) - We trained Rocchio (Cosine distance) and - kNN is resource intensive, especially in testing. classification errors, when only a small - Preprocessors / Feature Selection Naive Bayes (MV) on subsets of our data, and subset of the words is useful for - Feature selection is more suitable for both • N-gram Tokenizer plotted the results: classification”. Naive Bayes MV and kNN. • Stemming* (Porter/Lancaster) • Part of Speech Tagger* - Below you can see the classification - Mutual Information helps visualizing our data, • Pruning of infrequent words - Similar to [Ott-2011] findings using LIWC accuracy in percentage for different values of let alone its use for Feature Selection. • Mutual Information** - Almost same term-rank with Porter stemmer. k, using different features - Would be better to try combining MI into our - Results Evaluation Classifiers and check results accordingly. • Accuracy - Rocchio just went from 78.25% to 78.5% with • Precision / Recall porter stemmer (p >> 0.05) - Stemming and n-grams did not offer any • F-Score (a=1/2 => 2PR/(P+R)) - Somehow, bi-grams and tri-grams ranks didnt significant improvement, due to the nature of the change a lot from uni-grams top informative terms. * NLTK package was used ** Stand-alone michigan ave vs michigan, the floor vs floor, - Our results for PoS using Rocchio and NB were husband and and my husband vs husband, etc. far away from SVM/PoS results Feature Selection - Removing stop words!?[Joachims-1996] listed 3 steps for feature - Rocchio results for unigrams (78.25%), bigramsselection: (81.125%) and trigrams (78.625%) [p = 0.178]- Pruning of infrequent words. (3+ times) - We also agreed with [Rayson-2001] and [Ott-- Pruning of high frequent words. (Stop word)- Choosing words with high Mutual Information. 2011] regarding (Truthful) PoS tags K-Nearest Neighbor ResultsNaive Bayes (Pruning of infrequent words)- Multivariate: ↑ Accuracy (87.63% => 87.88%) Average Accuracy: - Not statistically significant. (p = 0.58 >> 0.05) - Same for Precision and Recall - Naive Bayes [Muli-Variate, Terms] = 87.625 %- Multinomial: ↓ Accuracy (88.5% => 87.88) - Naive Bayes [Muli-Nomial, Terms] = 88.5 % - Rocchio [Cosine, Terms] = 78.25 %Rocchio (Pruning of infrequent words) - Rocchio [Cosine, Bigrams] 81.125 %- Steady till frequency < 7, then degradation- My interpretation (Scientific!?) - KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 % - Truncating *shallow* axises in Vector Space! - Centroid already not able to move much there. - Twitter dataset (@AppleNws and @NokiaUS) References - We got best result (Accuracy =73.875%) when Naive Bayes MV has slightly better recall than - 5 folds x 40 tweets k was set 105. A Probabilistic Analysis of the Rocchio Algorithm NM (0.92 @ p=0.18), while MN has better slightly with TFIDF for Text Categorization. [Joachims- - Notice: We set k = k – 1, if k is even number. precision (0.88 @ p=0.012) 1996] - Notice how accuracy goes to 50% when k = They both are much more precise than Rocchio Centroid-based document classification: Analysis number of documents (we have equal number of (p < 0.01), and have better recall too (p < 0.05) - Apple is a bot. and experimental results. [Han-2000] Truthful and Deceptive documents) However, as we have seen earlier, Rocchio - Nokia used You, Your and RT more Grammatical word class variation within the excels, trained on fewer data British National Corpus Sampler. [Rayson-2001] - Nokia uses more personal pronouns, whereas Apple uses more Hashtags Finding deceptive opinion spam by any stretch of the imagination [Ott-2011] NB (97.47%), Rocchio (92.47), NB/PoS (85.91%)