2. Outline
Introduction
Fake Review Identification
Motivation
Related Work
Research Question
Research Methodology
Results & Analysis
Conclusion and Future Work
3. Introduction
E-commerce website
Online platform for sale and purchase
Services and products
Example of service: Restaurant, Beauty Parlor, Home cleaners
Example of goods: Vehicles, Garments, Electronic devices
3
4. Reviews
Also called Opinion/ Suggestion
User generated content
Experience about the product /service
Review content
Rating
4
5. Guide new customers
Beneficial for Businesses
Positive or Negative [Jindal et, al. 2008]
5
Importance of User Reviews
Positive
Review
Negative
Review
6. Fake Review Detection
Fake/Untruthful reviews
Mislead users/customers
Posted by Spammers
Financial gain for business
Positive/Negative Polarity
Types: [Jindal et, al. 2007]
Untruthful
Brand Reviews
Non-reviews
6
7. Motivation
Untruthful reviews effects so high
Influence User Decision
Mislead new Customers
Effect business market strategies
Effect business financially
Losing trust over e-commerce website
Identification of untruthful reviews
Exploitation of different features
7
8. Related Work
Fake Review Detection
[Jindal et al. 2007, Jindal et al. 2008, Algur et al., 2010, F. Li et al. 2011, Ott et al. 2011, Wu, Greene et al.
2010, Lai et al. 2010, H. Li et al. 2014, Lin Zhu et al. 2014, Ott et al. 2013, Mukherjee et al., 2013, D. Zhang et al.,
2016 ]
Spammer Identification
[Wang et al. 2011, Akoglu et al. 2013, Fei Geli et al. 2013]
Group Spammer Detection
[Liu et al. 2012, Mukherjee et al. 2011]
8
9. Related Work on Fake Review Detection
Year Author
Dataset
Type/Source
Classifier Feature Type
2007 Nitin Jindal et. al Pseudo Fake/Amazon LR Contextual
2008 Nitin Jindal et. al Pseudo Fake/Amazon LR Contextual
2010 C. Lai et. al Pseudo Fake/Amazon SVM Contextual
2010 Siddu Algur et. al Pseudo Fake/ Web Page - Contextual
2011 Fangtao Li et. al Pseudo Fake/Epinions LR, SVM,NB Contextual
2013 Arjun Mukherjee et. al Real life/Yelp SVM Contextual, Behavioral
2014 H. Li et. al Real life/Diaping SVM Contextual, Behavioral
2014 Yuming Lin et. al Pseudo Fake/Amazon LR, SVM Contextual, Behavioral
2016 Istiaq Ahsan et. al
Pseudo Fake, Real
Life/AMT+Yelp
NB, SVM Contextual
2016 Dongsong Zhang et. al Real life/Yelp SVM, DT, RF, NB Contextual, Behavioral
Year Author
Dataset
Type/Source
Classifier Feature Type
2007 Nitin Jindal et. al Pseudo Fake/Amazon LR Contextual
2008 Nitin Jindal et. al Pseudo Fake/Amazon LR Contextual
2010 C. Lai et. al Pseudo Fake/Amazon SVM Contextual
2010 Siddu Algur et. al Pseudo Fake/ Web Page - Contextual
2011 Fangtao Li et. al Pseudo Fake/Epinions LR, SVM,NB Contextual
2013 Arjun Mukherjee et. al Real life/Yelp SVM Contextual, Behavioral
2014 H. Li et. al Real life/Diaping SVM Contextual, Behavioral
2014 Yuming Lin et. al Pseudo Fake/Amazon LR, SVM Contextual, Behavioral
2016 Istiaq Ahsan et. al
Pseudo Fake, Real
Life/AMT+Yelp
NB, SVM Contextual
2016 Dongsong Zhang et. al Real life/Yelp SVM, DT, RF, NB Contextual, Behavioral
9
10. Related Work on Fake Review Detection
Features
Contextual Features
Behavioral Features
Dataset
Pseudo Fake Review
Real-life Review
Classifiers
10
11. Contextual Features
Extracted from content of the review [Li et, al 2011, Mukherjee et,
2013,Algur, et. 2010, Zhang et, al.2016]
For example: review length, capital diversity
Behavioral Features
Represent the behavior of reviewer and review [Mukherjee et,
2013,Algur, et. 2010, Zhang et, al.2016]
For example: average posting rate, positive ratio, review
duration
11
12. Features
“Reviewer Content Similarity”: [Zhang et, al 2016, Mukherjee et, al. 2013]
A Contextual Feature
It shows average text similarity of all posted reviews of a reviewer
“Reviewer Deviation” : [Mukherjee et al, 2013]
A Behavioral Feature
It captures variation in review rating on a restaurant.
12
13. Research Questions
RQ1: What is effect of “reviewer deviation” if combined with
other contextual and behavioral features to identify fake reviews
on Yelp dataset?
RQ2: What is the importance of “reviewer deviation” compared
with other behavioral features for fake review detection model
training?
RQ3: What is effect of different weighting schemes calculating
the “Reviewer Content Similarity” feature of reviewer?
13
16. Preprocessing
Remove Invalid Values
Transform the values of attribute in desired format
E.g. The attribute “date” of review, conversion of String format into DATE and
removing “Update” keyword
i.e. “Updated – 01-08-2010” into “01-08-2010”
16
18. FS1
FS2
FS3
FS4
FS5
Classifiers
1) Random Forest (RF) [Zhang et. al 2016 ]
2) Support Vector Machine (SVM) [Yuming Lin et. al 2014, C. Lai et. al 2010, Fangtao Li et. Al
2011, Arjun Mukherjee et. al 2013, H. Li et. al 2014, Yuming Lin et. al 2014, Istiaq Ahsan et. Al 2016, Zhang et al 2016 ]
Experimental Setup
Feature Sets
Restaurant
Hotel
Zhang et al. 2016
Zhang et al. 2016
Mukherjee et al. 2013
Evaluation
10-fold Cross Validation
Precision, Recall, F1 and Accuracy
19. Features Set
1) Reviewer Deviation
2) Positive Ratio
3) Maximum Number of Reviews
4) Content Length
5) N-grams
6) Reviewer Content Similarity
(FS3) Arjun Mukherjee et. al, 2013
Behavioral
Contextual
19
24. 24
69
74
79
84
89
94
D1 D1 D1 D1 D1 D1 D1 D1 D2 D2 D2 D2 D2 D2 D2 D2
P R F1 A P R F1 A P R F1 A P R F1 A
SVM RF SVM RF
FS1 FS2 FS3
Feature set Comparison
29. Results on Hotel Reviews
Importance Score of Features on Hotel Reviews
29
30. Title:
“Exploring Behavioral Features with Contextual Feature to
Identify Fake Reviews ”
Conference:
The 23rd Conference on
Natural Language & Information Systems (NLDB2018)
13rd - 15th June 2018, Paris, France
Status: ACCEPTED
Achievements
30
31. Conclusion & Future Work
Behavioral feature “Reviewer Deviation” improves the overall
accuracy
Dataset scaling can also increase affect of behavioral features
BM25 term weighting scheme also effects the classification
results with improvement
Spammer and spammer group detection can be explored with
variety of features
Deep Learning Approaches can also be adopted
31
32. References
1. Heydari, A., Tavakoli, M. A., Salim, N., & Heydari, Z. (2015). Detection of review
spam: A survey. Expert Systems with Applications.
2. Jindal, N., & Liu, B. (2007a). Analyzing and detecting review spam. Proceedings –
IEEE International Conference on Data Mining, ICDM, 547–552.
3. Algur, S., Hiremath, E., Patil, A., & Shivashankar, S. (2010). Spam detection of
customer reviews from web pages. In Proceedings of the 2nd international
conference on it and business intelligence (pp. 1–13).
4. Algur, S. P., Patil, A. P., Hiremath, P. S., & Shivashankar, S. (2010). Conceptual level
similarity measure based review spam detection. Signal and Image Processing
ICSIP 2010 International Conference on, 416–423.
5. Istiaq Ahsan, M., Nahian, T., All Kafi, A., Ismail Hossain, M., & Muhammad Shah, F.
(2016). An Ensemble approach to detect Review Spam using hybrid
MachineLearning Technique. Computer and Information Technology (ICCIT) 19th
International Conference on IEEE, 388–394.
32
33. References
6. Jindal, N., & Liu, B. (2008). Opinion spam and analysis. Proceedings of the
international conference on web search and web data mining 2008, 219–230.
7. Lai, C. L., Xu, K. Q., Lau, R. Y. K., Li, Y., & Jing, L. (2010). Toward a language modeling
approach for consumer review spam detection. Proceedings - IEEE International
Conference on E-Business Engineering, ICEBE 2010, 1–8.
8. Li, F., Huang, M., Yang, Y., & Zhu, X. (2011). Learning to identify review spam. In
Ijcai proceedings-international joint conference on artificial intelligence (Vol. 22, p.
2488).
9. Zhang, D., Zhou, L., Kehoe, J. L., & Kilic, I. Y. (2016). What Online Reviewer
Behaviors Really Matter? Effects of Verbal and Nonverbal Behaviors on Detection
of Fake Online Reviews. Journal of Management Information Systems, 33(2), 456–
481.
10. Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. (2013b). What Yelp Fake
Review Filter Might Be Doing? Seventh International AAAI, 409–418.
33
34. References
11. Istiaq Ahsan, M., Nahian, T., All Kafi, A., Ismail Hossain, M., & Muhammad Shah, F.
(2016). An Ensemble approach to detect Review Spam using hybrid Machine
Learning Technique. Computer and Information Technology (ICCIT) 19th
International Conference on IEEE, 388–394.
34
37. Average Posting Rate
𝐴𝑃𝑅 𝑎 =
𝑁𝑟(𝑎)
𝑁(𝑝𝑜𝑠𝑡𝑖𝑛𝑔𝑑𝑎𝑦𝑠)
It shows the ratio of total reviews of a reviewer to number of reviewer active days.
An active day is that on which reviewer has posted atleast one review.
37
38. Positive Ratio
𝑅𝑝𝑜𝑠 𝑎 =
𝑁𝑟( 𝑟𝑎 𝑟𝑎𝑡𝑖𝑛𝑔𝑟 ≥ 4})
𝑁𝑟(𝑎)
It shows reviews having more than or equal to 4 as rating value rating divided by total
number of reviews of a reviewer
38
39. Positive-to-Negative Ratio
𝑅𝑝𝑛 𝑎 =
𝑁𝑟( 𝑟𝑎 𝑟𝑎𝑡𝑖𝑛𝑔𝑟 ≥ 4})
𝑁𝑟( 𝑟𝑎 𝑟𝑎𝑡𝑖𝑛𝑔𝑟 ≤ 2})
It shows the ratio of a reviewer having more than or equal to 4 reviews rating value to the
reviews having less than or equal to 2 rating value
39
40. Review Duration
𝑅𝐷(𝑎) = 𝐷𝑙(𝑎) − 𝐷𝑓(𝑎)
Different of first posted review and last posted review of reviewer
40
41. Reviewer Deviation
𝑅𝑒𝑣𝐷𝑒𝑣 𝑟 = 𝑟𝑎𝑡𝑖𝑛𝑔 −
𝑟𝑎𝑡𝑖𝑛𝑔𝑟(𝑝)
𝑁𝑟(𝑝)
It captures variation in review rating on a restaurant. It is computed by substrating review
rating with absolute deviation of all ratings on a restaurant
41
42. Reviewer Content Similarity
𝑅𝐶𝑆 𝑎 =
𝑖
𝑛
max( 𝑗
𝑛
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑟𝑖, 𝑟𝑗))
𝑛
It shows average text similarity of all posted reviews of a reviewer
42
44. Jindal et, al. 2007 - 2008
They discovered spamming activities including identifying duplicate or near
duplicate reviews using shingle method.
For identifying brand reviews and non-reviews, dissimilarity between product meta
data and review content were used.
Spammer groups were identified by calculating content similarity of reviews of
different reviewers.
78% on AUC
44
45. Algur et, al. 2010
two annotators were hired
dataset containing 960 reviews.
Identified duplicate and near duplicate reviews using humming distance.
57% percent accuracy
45
46. Lai, Xu, Lau, Li, & Jing, 2010
identify untruthful and non-reviews
Feature set for identifying non-reviews includes lexical, syntactical and stylistic
features
Two annotators were hired
SVM acquired 96% recall in classifying non-reviews
Three Types of contextual features were used to identify untruthful reviews
46
47. Istiaq Ahsan et. al 2016
Unlabeled dataset contains reviews from Yelp and Labeled dataset of (Ott, Cardie,
& Hancock, 2013) were used.
Duplicate reviews were identified from unlabeled dataset using KL-JS disctance
The accuracy of 88% was reported using NB.
47