SlideShare a Scribd company logo
1 of 1
Review Spam Classification
Tarek Amr – University of East Anglia
Introduction
References
Mutual Information Size of Dataset k-NN Continued Conclusion
Top-10 terms with highest MI
- According to [Joachims-1996], Rocchio excels
when training data is smaller.
- However, its improvement does not increase
with the same rate as Naive Bayes
- We trained Rocchio (Cosine distance) and
Naive Bayes (MV) on subsets of our data, and
plotted the results:
- As stated by [Han-2000]:
“A major drawback of the similarity measure
used in k-NN is that it uses all features
equally in computing similarities. This can
lead to poor similarity measures and
classification errors, when only a small
subset of the words is useful for
classification”.
- Below you can see the classification
accuracy in percentage for different values of
k, using different features
A Probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization. [Joachims-
1996]
Centroid-based document classification: Analysis
and experimental results. [Han-2000]
Grammatical word class variation within the
British National Corpus Sampler. [Rayson-2001]
Finding deceptive opinion spam by any stretch of
the imagination [Ott-2011]
- Detecting Review Spam
- Classification Algorithms:
• Naive Bayes
• Multinomial
• Multivariate (Bernoulli)
• Rocchio (Cosine/Euclidean)
• K-Nearest Neighbour (C/E)
- Preprocessors / Feature Selection
• N-gram Tokenizer
• Stemming* (Porter/Lancaster)
• Part of Speech Tagger*
• Pruning of infrequent words
• Mutual Information**
- Results Evaluation
• Accuracy
• Precision / Recall
• F-Score (a=1/2 => 2PR/(P+R))
* NLTK package was used ** Stand-alone
Feature Selection
[Joachims-1996] listed 3 steps for feature
selection:
- Pruning of infrequent words. (3+ times)
- Pruning of high frequent words. (Stop word)
- Choosing words with high Mutual Information.
Naive Bayes (Pruning of infrequent words)
- Multivariate: ↑ Accuracy (87.63% => 87.88%)
- Not statistically significant. (p = 0.58 >> 0.05)
- Same for Precision and Recall
- Multinomial: ↓ Accuracy (88.5% => 87.88)
Rocchio (Pruning of infrequent words)
- Steady till frequency < 7, then degradation
- My interpretation (Scientific!?)
- Truncating *shallow* axises in Vector Space!
- Centroid already not able to move much there. - Twitter dataset (@AppleNws and @NokiaUS)
- 5 folds x 40 tweets
- Apple is a bot.
- Nokia used 'You', 'Your' and 'RT' more
- Nokia uses more personal pronouns, whereas
Apple uses more Hashtags
NB (97.47%), Rocchio (92.47), NB/PoS (85.91%)
- Similar to [Ott-2011] findings using LIWC
- Almost same term-rank with Porter stemmer.
- Rocchio just went from 78.25% to 78.5% with
porter stemmer (p >> 0.05)
- Somehow, bi-grams and tri-grams ranks didn't
change a lot from uni-grams
'michigan ave' vs 'michigan', 'the floor' vs 'floor',
'husband and' and 'my husband' vs 'husband', etc.
- Removing stop words!?
- Rocchio results for unigrams (78.25%), bigrams
(81.125%) and trigrams (78.625%) [p = 0.178]
- We also agreed with [Rayson-2001] and [Ott-
2011] regarding (Truthful) PoS tags K-Nearest Neighbor
- We got best result (Accuracy =73.875%) when
k was set 105.
- Notice: We set k = k – 1, if k is even number.
- Notice how accuracy goes to 50% when k =
number of documents (we have equal number of
Truthful and Deceptive documents)
Results
Average Accuracy:
- Naive Bayes [Muli-Variate, Terms] = 87.625 %
- Naive Bayes [Muli-Nomial, Terms] = 88.5 %
- Rocchio [Cosine, Terms] = 78.25 %
- Rocchio [Cosine, Bigrams] 81.125 %
- KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 %
Naive Bayes MV has slightly better recall than
NM (0.92 @ p=0.18), while MN has better slightly
precision (0.88 @ p=0.012)
They both are much more precise than Rocchio
(p < 0.01), and have better recall too (p < 0.05)
However, as we have seen earlier, Rocchio
excels, trained on fewer data
- Statistical nature of text varies from one dataset
to the other, and results vary accordingly.
- Naive Bayes outperformed TFIDF Algorithms.
- With fewer data, Rocchio outperforms NB.
- kNN is resource intensive, especially in testing.
- Feature selection is more suitable for both
Naive Bayes MV and kNN.
- Mutual Information helps visualizing our data,
let alone its use for Feature Selection.
- Would be better to try combining MI into our
Classifiers and check results accordingly.
- Stemming and n-grams did not offer any
significant improvement, due to the nature of the
top informative terms.
- Our results for PoS using Rocchio and NB were
far away from SVM/PoS results

More Related Content

Similar to Deceptive Spam

SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...HPCC Systems
 
System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...Bruce WANG
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approchanil maurya
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffRaman Kannan
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingJinho Choi
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptxMahimMajee
 
System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...Bruce WANG
 
Basic Inference Analysis
Basic Inference AnalysisBasic Inference Analysis
Basic Inference AnalysisAmeen AboDabash
 

Similar to Deceptive Spam (20)

SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
 
Sampling Size
Sampling SizeSampling Size
Sampling Size
 
Data and Statistics
Data and StatisticsData and Statistics
Data and Statistics
 
System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Naive.pdf
Naive.pdfNaive.pdf
Naive.pdf
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approch
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Kdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar DasKdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar Das
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
20120140505011
2012014050501120120140505011
20120140505011
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
 
Naive_hehe.pptx
Naive_hehe.pptxNaive_hehe.pptx
Naive_hehe.pptx
 
System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...System performance as a function of calibration methods, sample size and samp...
System performance as a function of calibration methods, sample size and samp...
 
Basic Inference Analysis
Basic Inference AnalysisBasic Inference Analysis
Basic Inference Analysis
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Deceptive Spam

  • 1. Review Spam Classification Tarek Amr – University of East Anglia Introduction References Mutual Information Size of Dataset k-NN Continued Conclusion Top-10 terms with highest MI - According to [Joachims-1996], Rocchio excels when training data is smaller. - However, its improvement does not increase with the same rate as Naive Bayes - We trained Rocchio (Cosine distance) and Naive Bayes (MV) on subsets of our data, and plotted the results: - As stated by [Han-2000]: “A major drawback of the similarity measure used in k-NN is that it uses all features equally in computing similarities. This can lead to poor similarity measures and classification errors, when only a small subset of the words is useful for classification”. - Below you can see the classification accuracy in percentage for different values of k, using different features A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. [Joachims- 1996] Centroid-based document classification: Analysis and experimental results. [Han-2000] Grammatical word class variation within the British National Corpus Sampler. [Rayson-2001] Finding deceptive opinion spam by any stretch of the imagination [Ott-2011] - Detecting Review Spam - Classification Algorithms: • Naive Bayes • Multinomial • Multivariate (Bernoulli) • Rocchio (Cosine/Euclidean) • K-Nearest Neighbour (C/E) - Preprocessors / Feature Selection • N-gram Tokenizer • Stemming* (Porter/Lancaster) • Part of Speech Tagger* • Pruning of infrequent words • Mutual Information** - Results Evaluation • Accuracy • Precision / Recall • F-Score (a=1/2 => 2PR/(P+R)) * NLTK package was used ** Stand-alone Feature Selection [Joachims-1996] listed 3 steps for feature selection: - Pruning of infrequent words. (3+ times) - Pruning of high frequent words. (Stop word) - Choosing words with high Mutual Information. Naive Bayes (Pruning of infrequent words) - Multivariate: ↑ Accuracy (87.63% => 87.88%) - Not statistically significant. (p = 0.58 >> 0.05) - Same for Precision and Recall - Multinomial: ↓ Accuracy (88.5% => 87.88) Rocchio (Pruning of infrequent words) - Steady till frequency < 7, then degradation - My interpretation (Scientific!?) - Truncating *shallow* axises in Vector Space! - Centroid already not able to move much there. - Twitter dataset (@AppleNws and @NokiaUS) - 5 folds x 40 tweets - Apple is a bot. - Nokia used 'You', 'Your' and 'RT' more - Nokia uses more personal pronouns, whereas Apple uses more Hashtags NB (97.47%), Rocchio (92.47), NB/PoS (85.91%) - Similar to [Ott-2011] findings using LIWC - Almost same term-rank with Porter stemmer. - Rocchio just went from 78.25% to 78.5% with porter stemmer (p >> 0.05) - Somehow, bi-grams and tri-grams ranks didn't change a lot from uni-grams 'michigan ave' vs 'michigan', 'the floor' vs 'floor', 'husband and' and 'my husband' vs 'husband', etc. - Removing stop words!? - Rocchio results for unigrams (78.25%), bigrams (81.125%) and trigrams (78.625%) [p = 0.178] - We also agreed with [Rayson-2001] and [Ott- 2011] regarding (Truthful) PoS tags K-Nearest Neighbor - We got best result (Accuracy =73.875%) when k was set 105. - Notice: We set k = k – 1, if k is even number. - Notice how accuracy goes to 50% when k = number of documents (we have equal number of Truthful and Deceptive documents) Results Average Accuracy: - Naive Bayes [Muli-Variate, Terms] = 87.625 % - Naive Bayes [Muli-Nomial, Terms] = 88.5 % - Rocchio [Cosine, Terms] = 78.25 % - Rocchio [Cosine, Bigrams] 81.125 % - KNN [Cosine, Min. Freq=3, k=153 ] = 76.375 % Naive Bayes MV has slightly better recall than NM (0.92 @ p=0.18), while MN has better slightly precision (0.88 @ p=0.012) They both are much more precise than Rocchio (p < 0.01), and have better recall too (p < 0.05) However, as we have seen earlier, Rocchio excels, trained on fewer data - Statistical nature of text varies from one dataset to the other, and results vary accordingly. - Naive Bayes outperformed TFIDF Algorithms. - With fewer data, Rocchio outperforms NB. - kNN is resource intensive, especially in testing. - Feature selection is more suitable for both Naive Bayes MV and kNN. - Mutual Information helps visualizing our data, let alone its use for Feature Selection. - Would be better to try combining MI into our Classifiers and check results accordingly. - Stemming and n-grams did not offer any significant improvement, due to the nature of the top informative terms. - Our results for PoS using Rocchio and NB were far away from SVM/PoS results