SlideShare a Scribd company logo
1 of 30
FAKE REVIEWS DETECTION USING
SUPERVISED MACHINE LEARNING
ASNA HARIS
S7,CSE A
INDEX
o Introduction
o Background
o Proposed Approach
o Experimental Results
o Conclusion
o References
• This research paper focuses on identifying fake reviews in the context of
E-commerce systems, specifically using a machine learning approach.
1. Importance of Online Reviews: Online reviews are crucial for
building and maintaining a good reputation for E-commerce
businesses.
• They also play a significant role in influencing the decision-making
process of end users.
• Positive reviews are known to attract more customers and increase
sales.
2. Rise of Fake Reviews: The abstract highlights that fake or deceptive
reviews are becoming more common.
• These fake reviews are written intentionally to create a positive image
and attract potential customers.
INTRODUCTION
3. Focus on Identifying Fake Reviews: The research presented in this paper
focuses on the identification of fake reviews.
• This is a challenging problem because it involves not only analyzing the
content of the reviews but also examining the behaviors of the reviewers.
4. Methodology: The paper proposes a machine learning approach to identify
fake reviews.
• In addition to extracting key features from the reviews themselves, the paper
also incorporates features related to the behaviors of the reviewers.
5. Experimental Setup: The research uses a real Yelp dataset consisting of
restaurant reviews for experimentation.
• Various machine learning classifiers are employed, including K-Nearest
Neighbors (KNN), Naive Bayes (NB), Support Vector Machine (SVM), Logistic
Regression, and Random Forest.
• Different language models like bi-gram and tri-gram are also considered
during the evaluations.
6. Results: The results of the experiments show that KNN with K=7
outperforms the other classifiers in terms of f-score, achieving the best f-
score of 82.40%.
• Importantly, the inclusion of reviewers' behavioral features leads to an
improvement in the f-score by 3.80%.
• In summary, the research described in this abstract focuses on the
problem of identifying fake reviews in E-commerce systems using
machine learning.
• It emphasizes the importance of considering both review content and
reviewer behavior, and it presents experimental results that demonstrate
the effectiveness of the proposed approach, with KNN as the best-
performing classifier.
BACKGROUND
• Machine learning is one of the most important technological trends that lies
behind many critical applications.
• There are several types of machine learning algorithms; namely supervised,
semi-supervised, and unsupervised machine learning.
• In the supervised approach, both input and output data are provided and the
training data must be labeled and classified.
• In the unsupervised learning approach, only the input data is given without
any classification or labels and the role of the approach is to find the best fit
clustering or classification of the input data.
• Thus, in unsupervised learning, all data are unlabeled, and the approach’s
role is to label them.
• Finally, in the semi supervised approach, some data are labeled but the most
are unlabeled
• The main objective of these algorithms is to find a proper model that
disseminates the training data.
• For example, Support Vector Machines (SVM) is a discriminated classifier
that separates the given data into classes by finding the best separable
hyper-plane that categorizes the given training data.
• Another Common supervised learning algorithm is Naive Bayes (NB).
• The key idea of NB relies on Bayes theorem; the probability of event A to
happen given the probability of event B which is formed as P(A/B) =
P(B/A)*P(A)/ P(B) .
• NB calculates a set of probabilities by counting the frequency and the
combined values in a given dataset.
• NB has been successfully applied in several application domains like text
classification, spam filtering, and recommendation systems.
• The K-Nearest Neighbors algorithm (or KNN) is one of the most, simple
yet powerful classification algorithms.
 KNN has been used mostly in statistical estimation and pattern
recognition.
 The key idea behind KNN is to classify instance query based on voting
of a group of similar classified instances.
 The similarity is usually calculated using the distance function.
• Decision-tree is another machine learning classifier that relies on building
a tree that represents a decision of instances training data.
 The Algorithm starts to construct the tree iteratively based on the best
possible split among features.
 The selection process of the best features relies on predefined
functions .
• Logistic regression is another simple supervised machine learning
classifier.
 Logistic regression is a calculation used to predict a binary
outcome: either something happens, or does not.
 Independent variables are analyzed to determine the binary
outcome with the results falling into one of two categories.
 The independent variables can be categorical or numeric, but the
dependent variable is always categorical. Written like this:
P(Y=1|X) or P(Y=0|X)
• Random Forest is a successful method that handles the overfitting
problems that occur in the decision tree.
 The key essence of random forest is to construct a bag of trees
from different samples of the dataset.
 Instead of constructing the tree from all features, Random forest
generates small random number of features while constructing
each tree in the forest.
PROPOSED APPROACH
The proposed approach consists of three basic phases in order to get the best
model that will be for fake reviews detection. These phases are explained in
the following:
• Data Preprocessing
• Feature Extraction
• Feature Engineering
1. Data Preprocessing:
• The first step in the proposed approach is data preprocessing, is one of the
essential steps in machine learning approaches.
• Data preprocessing is a critical activity as the word data is never appropriate
to be used.
• A sequence of preprocessing steps have been used in this work to prepare
the raw data of the Yelp dataset for computational activities. This can be
summarized as follows:
1) Tokenization: Tokenization is one of the most common natural language
processing techniques. It is a basic step before applying any other
preprocessing techniques.
• The text is divided into individual words called tokens.
• For example, if we have a sentence (“wearing helmets is a must for pedal
cyclists”), tokenization will divide it into the following tokens (“wearing” ,
“helmets” , “is” , “a”, “must”, “for” , “pedal” , “cyclists”)
2) Stop Words Cleaning: Stop words are the words that are used the most,
yet they hold no value. Common examples of stop words are (an, a, the, this).
 In this paper, all data are cleaned from stop words before going forward in
the fake review detection process.
3) Lemmatization: Lemmatization method is used to convert the plural
format to a singular one.
 It aims to remove inflectional endings only and to return the base or
dictionary form of the word. For example: converting the word (“plays”) to
(“play”)
• Feature extraction is a step that aims to increase the performance either for a
pattern recognition or machine learning system.
• Feature extraction represents a reduction phase of the data to its important
features which yields in feeding machine and deep learning models with more
valuable data.
• It is mainly a procedure of removing the unneeded attributes from data that
may reduce the accuracy of the model
• Several approaches have been developed in the literature to extract features for
fake reviews detection.
• Textual features is one popular approach. It contains sentiment classification
which depends on getting the percent of positive and negative words in the
review; e.g. “good”, “weak”.
2. Feature extraction:
• Secondly there are user personal profiles and behavioral features.
• These features are the two ways used to identify spammers whether by using
time-stamp of user’s comment is more frequent and unique than other normal
users or if the user posts a redundant review and has no relation to the domain
of the target.
• In this paper, We apply TF-IDF to extract the features of the contents in two
language models; mainly bi-gram and tri-gram.
• In both language models, we apply also the extended dataset after extracting
the features representing the user behaviors.
3. Feature Engineering :
• Fake reviews often exhibit distinct characteristics related to the
behaviors of the reviewers while composing their reviews.
• In this paper, we delve into some of these features and explore how
they impact the efficacy of fake review detection.
• Specifically, we analyze the following behavioral features: caps-count,
punct-count, and emojis.
1. Caps-Count: This feature quantifies the total number of capital letters
employed by a reviewer when crafting their review.
2. Punct-Count: It measures the total count of punctuation marks found in each
review.
3. Emojis: This feature tallies the total number of emojis included in each review.
Additionally, we apply statistical analysis to assess reviewers' behaviors.
• We utilize the "groupby" function, which allows us to determine the number of
fake and real reviews authored by each reviewer on specific dates and at
particular hotels.
• All of these features are integrated into our analysis to investigate how user
behaviors impact the performance of our classifiers.
EXPERIMENTAL RESULTS
• We evaluated our proposed system on Yelp
dataset.
• This dataset includes 5853 reviews of 201
hotels in Chicago written by 38, 063
reviewers.
• The reviews are classified into 4, 709 reviews
labeled as real and 1, 144 reviews labeled as
fake.
• Yelp has classified the reviews into genuine
and fake.
• Each instance of the review in the dataset
contains the review date, review ID, reviewer
ID, product ID, review label, and star rating.
The statistics of the dataset is summarized in
Table I.
• In addition to the dataset and its statistics, we extracted other features
representing the behaviors of reviewers during writing their reviews.
• These features include caps-count , punct-count, and emojis which
count the total number of emojis in each review.
• We will take all these features into consideration to see the effect of the
user’s behaviors on the performance of the classifiers.
• we present the results of several experiments and their evaluation using five
different machine learning classifiers.
• We first apply TF-IDF to extract the features of the contents in two language
models; mainly bi-gram and tri-gram.
• In both language models, we also apply the extended dataset after extracting
the features representing the user’s behaviors mentioned in the last section.
• Since the dataset is unbalanced in terms of positive and negative labels, we take
into consideration the precision and the recall. Hence, f1-score is considered as a
performance measure in addition to accuracy.
• 70% of the dataset is used for training while 30% is used for testing.
• The classifiers are first evaluated in the absence of extracted features behaviors
of users and then in the presence of the extracted behaviors.
• In each case, we compare the performance of classifiers in Bi-gram and Trigram
language models
• Table II Summarizes the results of accuracy in the absence of extracted features
behaviors of users in the two language models.
• The average accuracy for each classifier
of the two language models is shown.
• It is found that the logistic regression
classifier gives the highest accuracy
of 87.87% in Bi-gram model.
• SVM and Random forest classifiers have relatively close accuracy to logistic
regression.
• In Tri-gram model, KNN and Logistic regression are the best with accuracy of
87.87%. SVM and Random forest have relatively close accuracy with score of
87.82%.
• It is found that the highest average accuracy is achieved in logistic regression
with 87.87%.
• On the other hand, Table III summarizes the
accuracy of the classifiers in the presence of
the extracted features. The results reveal
that the classifier that give the highest accuracy in
Bi-gram is SVM with score of 86.9%.
• Logistic regression and Random forest have
relativity close accuracy with scores of 86.89% and
86.85% respectively in the Tri-gram model, both SVM and logistic regression give
the best accuracy with score of 86.9%.
• The Random forest gives a close score of 86.8%.
• Also, it is found that the highest average accuracy is obtained with SVM classifier
with score of 86.9%
• Similar to the previous, Table IV represents
the recall, precision, and hence f1-score in the
absence of the extracted features behaviors of
the users in the two language models.
• For the trade-off between recall and precision,
f1-score is taken into account as the
evaluation criterion of each classifier.
• In Bi-gram, KNN(k=7) outperforms all other
classifiers with f1-score value of 82.40%.
• Whereas, in Tri-gram, both logistic regression and KNN(K-7)
outperform other classifiers with f1-score value of 82.20%.
• It is found that, KNN outperforms the overall classifiers with
average f1-score of 82.30
• Table V summarizes the recall, precision,
and f1- score in the presence of the extracted
features behaviors of the users in the two language
models. It is found that, the highest f1-score value
is achieved by Logistic regression with f1-score
value of 82% in the case of Bi-gram.
• While the highest f1-score value in Tri-gram is achieved in KNN with
f1-score value of 86.20%.
• The KNN classifier outperforms all classifiers in terms of the overall
average f1-score with the value of 83.73%.
• The results reveal that KNN(K=7) outperforms the rest of classifiers in
terms of f-score with the best achieving fscore 82.40%.
• The result is raised by 3.80% when taking the extracted features into
consideration giving the best f-score value of 86.20%.
CONCLUSION
• It is obvious that reviews play a crucial role in people’s decisions.
• Thus, fake reviews detection is a vivid and ongoing research area.
• In this paper, a machine learning fake reviews detection approach is
presented.
• In the proposed approach, both the features of the reviews and the
behavioral features of the reviewers are considered.
• The Yelp dataset is used to evaluate the proposed approach.
• Different classifiers are implemented in the developed approach.
• The Bi-gram and Trigram language models are used and compared in the
developed approach.
• The results reveal that KNN(with K=7) classifier outperforms the rest of
classifiers in the fake reviews detection process.
• Also, the results show that considering the behavioral features of the
reviewers increase the f-score by 3.80%.
• Future work may consider including other behavioral features such as
features that depend on the frequent times the reviewers do the reviews, the
time reviewers take to complete reviews, and how frequently they are
submitting positive or negative reviews.
• It is highly expected that considering more behavioral features will enhance
the performance of the presented fake reviews detection approach.
REFERENCES
[1] R. Barbado, O. Araque, and C. A. Iglesias, “A framework for fake review
detection in online consumer electronics retailers,” Information Processing &
Management, vol. 56, no. 4, pp. 1234 – 1244, 2019.
[2] S. Tadelis, “The economics of reputation and feedback systems in
ecommerce marketplaces,” IEEE Internet Computing, vol. 20, no. 1, pp. 12–19,
2016.
[3] M. J. H. Mughal, “Data mining: Web data mining techniques, tools and
algorithms: An overview,” Information Retrieval, vol. 9, no. 6, 2018.
[4] C. C. Aggarwal, “Opinion mining and sentiment analysis,” in Machine
Learning for Text. Springer, 2018, pp. 413–434.
[5] A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, “What yelp fake
review filter might be doing?” in Seventh international AAAI conference on
weblogs and social media, 2013.
[6] N. Jindal and B. Liu, “Review spam detection,” in Proceedings of the 16th
International Conference on World Wide Web, ser. WWW ’07, 2007.
[7] E. Elmurngi and A. Gherbi, Detecting Fake Reviews through Sentiment
Analysis Using Machine Learning Techniques. IARIA/DATA ANALYTICS, 2017.
QUESTIONS ????
THANK YOU!

More Related Content

Similar to seminar.pptx

لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...Egyptian Engineers Association
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 
CYBERBULLYING DETECTION USING MACHINE LEARNING-1 (1).pdf
CYBERBULLYING DETECTION USING              MACHINE LEARNING-1 (1).pdfCYBERBULLYING DETECTION USING              MACHINE LEARNING-1 (1).pdf
CYBERBULLYING DETECTION USING MACHINE LEARNING-1 (1).pdfKumbidiGaming
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
IRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET Journal
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...IJTET Journal
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptxnarmeen11
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Reinventing Auditing with Machine Learning
Reinventing Auditing with Machine LearningReinventing Auditing with Machine Learning
Reinventing Auditing with Machine LearningAndrew Clark
 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsIRJET Journal
 
It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!To Sum It Up
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxdaniahendric
 
IRJET- Sentiment Analysis: Algorithmic and Opinion Mining Approach
IRJET- Sentiment Analysis: Algorithmic and Opinion Mining ApproachIRJET- Sentiment Analysis: Algorithmic and Opinion Mining Approach
IRJET- Sentiment Analysis: Algorithmic and Opinion Mining ApproachIRJET Journal
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptxNaveenkushwaha18
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selectionDavis David
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptxhiblooms
 
Introduction to Machine Learning.pptx
Introduction to Machine Learning.pptxIntroduction to Machine Learning.pptx
Introduction to Machine Learning.pptxDr. Amanpreet Kaur
 

Similar to seminar.pptx (20)

Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
CYBERBULLYING DETECTION USING MACHINE LEARNING-1 (1).pdf
CYBERBULLYING DETECTION USING              MACHINE LEARNING-1 (1).pdfCYBERBULLYING DETECTION USING              MACHINE LEARNING-1 (1).pdf
CYBERBULLYING DETECTION USING MACHINE LEARNING-1 (1).pdf
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
IRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review Analysis
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Reinventing Auditing with Machine Learning
Reinventing Auditing with Machine LearningReinventing Auditing with Machine Learning
Reinventing Auditing with Machine Learning
 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection Methods
 
It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
 
IRJET- Sentiment Analysis: Algorithmic and Opinion Mining Approach
IRJET- Sentiment Analysis: Algorithmic and Opinion Mining ApproachIRJET- Sentiment Analysis: Algorithmic and Opinion Mining Approach
IRJET- Sentiment Analysis: Algorithmic and Opinion Mining Approach
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptx
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
 
Research proposal
Research proposalResearch proposal
Research proposal
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptx
 
Introduction to Machine Learning.pptx
Introduction to Machine Learning.pptxIntroduction to Machine Learning.pptx
Introduction to Machine Learning.pptx
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 

Recently uploaded

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptAfnanAhmad53
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...gragchanchal546
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 

Recently uploaded (20)

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .ppt
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 

seminar.pptx

  • 1. FAKE REVIEWS DETECTION USING SUPERVISED MACHINE LEARNING ASNA HARIS S7,CSE A
  • 2. INDEX o Introduction o Background o Proposed Approach o Experimental Results o Conclusion o References
  • 3. • This research paper focuses on identifying fake reviews in the context of E-commerce systems, specifically using a machine learning approach. 1. Importance of Online Reviews: Online reviews are crucial for building and maintaining a good reputation for E-commerce businesses. • They also play a significant role in influencing the decision-making process of end users. • Positive reviews are known to attract more customers and increase sales. 2. Rise of Fake Reviews: The abstract highlights that fake or deceptive reviews are becoming more common. • These fake reviews are written intentionally to create a positive image and attract potential customers. INTRODUCTION
  • 4. 3. Focus on Identifying Fake Reviews: The research presented in this paper focuses on the identification of fake reviews. • This is a challenging problem because it involves not only analyzing the content of the reviews but also examining the behaviors of the reviewers. 4. Methodology: The paper proposes a machine learning approach to identify fake reviews. • In addition to extracting key features from the reviews themselves, the paper also incorporates features related to the behaviors of the reviewers. 5. Experimental Setup: The research uses a real Yelp dataset consisting of restaurant reviews for experimentation. • Various machine learning classifiers are employed, including K-Nearest Neighbors (KNN), Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression, and Random Forest. • Different language models like bi-gram and tri-gram are also considered during the evaluations.
  • 5. 6. Results: The results of the experiments show that KNN with K=7 outperforms the other classifiers in terms of f-score, achieving the best f- score of 82.40%. • Importantly, the inclusion of reviewers' behavioral features leads to an improvement in the f-score by 3.80%. • In summary, the research described in this abstract focuses on the problem of identifying fake reviews in E-commerce systems using machine learning. • It emphasizes the importance of considering both review content and reviewer behavior, and it presents experimental results that demonstrate the effectiveness of the proposed approach, with KNN as the best- performing classifier.
  • 6. BACKGROUND • Machine learning is one of the most important technological trends that lies behind many critical applications. • There are several types of machine learning algorithms; namely supervised, semi-supervised, and unsupervised machine learning. • In the supervised approach, both input and output data are provided and the training data must be labeled and classified. • In the unsupervised learning approach, only the input data is given without any classification or labels and the role of the approach is to find the best fit clustering or classification of the input data. • Thus, in unsupervised learning, all data are unlabeled, and the approach’s role is to label them.
  • 7. • Finally, in the semi supervised approach, some data are labeled but the most are unlabeled • The main objective of these algorithms is to find a proper model that disseminates the training data. • For example, Support Vector Machines (SVM) is a discriminated classifier that separates the given data into classes by finding the best separable hyper-plane that categorizes the given training data. • Another Common supervised learning algorithm is Naive Bayes (NB). • The key idea of NB relies on Bayes theorem; the probability of event A to happen given the probability of event B which is formed as P(A/B) = P(B/A)*P(A)/ P(B) . • NB calculates a set of probabilities by counting the frequency and the combined values in a given dataset. • NB has been successfully applied in several application domains like text classification, spam filtering, and recommendation systems.
  • 8. • The K-Nearest Neighbors algorithm (or KNN) is one of the most, simple yet powerful classification algorithms.  KNN has been used mostly in statistical estimation and pattern recognition.  The key idea behind KNN is to classify instance query based on voting of a group of similar classified instances.  The similarity is usually calculated using the distance function. • Decision-tree is another machine learning classifier that relies on building a tree that represents a decision of instances training data.  The Algorithm starts to construct the tree iteratively based on the best possible split among features.  The selection process of the best features relies on predefined functions .
  • 9. • Logistic regression is another simple supervised machine learning classifier.  Logistic regression is a calculation used to predict a binary outcome: either something happens, or does not.  Independent variables are analyzed to determine the binary outcome with the results falling into one of two categories.  The independent variables can be categorical or numeric, but the dependent variable is always categorical. Written like this: P(Y=1|X) or P(Y=0|X) • Random Forest is a successful method that handles the overfitting problems that occur in the decision tree.  The key essence of random forest is to construct a bag of trees from different samples of the dataset.  Instead of constructing the tree from all features, Random forest generates small random number of features while constructing each tree in the forest.
  • 10. PROPOSED APPROACH The proposed approach consists of three basic phases in order to get the best model that will be for fake reviews detection. These phases are explained in the following: • Data Preprocessing • Feature Extraction • Feature Engineering
  • 11. 1. Data Preprocessing: • The first step in the proposed approach is data preprocessing, is one of the essential steps in machine learning approaches. • Data preprocessing is a critical activity as the word data is never appropriate to be used. • A sequence of preprocessing steps have been used in this work to prepare the raw data of the Yelp dataset for computational activities. This can be summarized as follows: 1) Tokenization: Tokenization is one of the most common natural language processing techniques. It is a basic step before applying any other preprocessing techniques. • The text is divided into individual words called tokens. • For example, if we have a sentence (“wearing helmets is a must for pedal cyclists”), tokenization will divide it into the following tokens (“wearing” , “helmets” , “is” , “a”, “must”, “for” , “pedal” , “cyclists”)
  • 12. 2) Stop Words Cleaning: Stop words are the words that are used the most, yet they hold no value. Common examples of stop words are (an, a, the, this).  In this paper, all data are cleaned from stop words before going forward in the fake review detection process. 3) Lemmatization: Lemmatization method is used to convert the plural format to a singular one.  It aims to remove inflectional endings only and to return the base or dictionary form of the word. For example: converting the word (“plays”) to (“play”)
  • 13. • Feature extraction is a step that aims to increase the performance either for a pattern recognition or machine learning system. • Feature extraction represents a reduction phase of the data to its important features which yields in feeding machine and deep learning models with more valuable data. • It is mainly a procedure of removing the unneeded attributes from data that may reduce the accuracy of the model • Several approaches have been developed in the literature to extract features for fake reviews detection. • Textual features is one popular approach. It contains sentiment classification which depends on getting the percent of positive and negative words in the review; e.g. “good”, “weak”. 2. Feature extraction:
  • 14. • Secondly there are user personal profiles and behavioral features. • These features are the two ways used to identify spammers whether by using time-stamp of user’s comment is more frequent and unique than other normal users or if the user posts a redundant review and has no relation to the domain of the target. • In this paper, We apply TF-IDF to extract the features of the contents in two language models; mainly bi-gram and tri-gram. • In both language models, we apply also the extended dataset after extracting the features representing the user behaviors.
  • 15. 3. Feature Engineering : • Fake reviews often exhibit distinct characteristics related to the behaviors of the reviewers while composing their reviews. • In this paper, we delve into some of these features and explore how they impact the efficacy of fake review detection. • Specifically, we analyze the following behavioral features: caps-count, punct-count, and emojis.
  • 16. 1. Caps-Count: This feature quantifies the total number of capital letters employed by a reviewer when crafting their review. 2. Punct-Count: It measures the total count of punctuation marks found in each review. 3. Emojis: This feature tallies the total number of emojis included in each review. Additionally, we apply statistical analysis to assess reviewers' behaviors. • We utilize the "groupby" function, which allows us to determine the number of fake and real reviews authored by each reviewer on specific dates and at particular hotels. • All of these features are integrated into our analysis to investigate how user behaviors impact the performance of our classifiers.
  • 17.
  • 18. EXPERIMENTAL RESULTS • We evaluated our proposed system on Yelp dataset. • This dataset includes 5853 reviews of 201 hotels in Chicago written by 38, 063 reviewers. • The reviews are classified into 4, 709 reviews labeled as real and 1, 144 reviews labeled as fake. • Yelp has classified the reviews into genuine and fake. • Each instance of the review in the dataset contains the review date, review ID, reviewer ID, product ID, review label, and star rating. The statistics of the dataset is summarized in Table I.
  • 19. • In addition to the dataset and its statistics, we extracted other features representing the behaviors of reviewers during writing their reviews. • These features include caps-count , punct-count, and emojis which count the total number of emojis in each review. • We will take all these features into consideration to see the effect of the user’s behaviors on the performance of the classifiers.
  • 20. • we present the results of several experiments and their evaluation using five different machine learning classifiers. • We first apply TF-IDF to extract the features of the contents in two language models; mainly bi-gram and tri-gram. • In both language models, we also apply the extended dataset after extracting the features representing the user’s behaviors mentioned in the last section. • Since the dataset is unbalanced in terms of positive and negative labels, we take into consideration the precision and the recall. Hence, f1-score is considered as a performance measure in addition to accuracy. • 70% of the dataset is used for training while 30% is used for testing. • The classifiers are first evaluated in the absence of extracted features behaviors of users and then in the presence of the extracted behaviors. • In each case, we compare the performance of classifiers in Bi-gram and Trigram language models
  • 21. • Table II Summarizes the results of accuracy in the absence of extracted features behaviors of users in the two language models. • The average accuracy for each classifier of the two language models is shown. • It is found that the logistic regression classifier gives the highest accuracy of 87.87% in Bi-gram model. • SVM and Random forest classifiers have relatively close accuracy to logistic regression. • In Tri-gram model, KNN and Logistic regression are the best with accuracy of 87.87%. SVM and Random forest have relatively close accuracy with score of 87.82%. • It is found that the highest average accuracy is achieved in logistic regression with 87.87%.
  • 22. • On the other hand, Table III summarizes the accuracy of the classifiers in the presence of the extracted features. The results reveal that the classifier that give the highest accuracy in Bi-gram is SVM with score of 86.9%. • Logistic regression and Random forest have relativity close accuracy with scores of 86.89% and 86.85% respectively in the Tri-gram model, both SVM and logistic regression give the best accuracy with score of 86.9%. • The Random forest gives a close score of 86.8%. • Also, it is found that the highest average accuracy is obtained with SVM classifier with score of 86.9%
  • 23. • Similar to the previous, Table IV represents the recall, precision, and hence f1-score in the absence of the extracted features behaviors of the users in the two language models. • For the trade-off between recall and precision, f1-score is taken into account as the evaluation criterion of each classifier. • In Bi-gram, KNN(k=7) outperforms all other classifiers with f1-score value of 82.40%. • Whereas, in Tri-gram, both logistic regression and KNN(K-7) outperform other classifiers with f1-score value of 82.20%. • It is found that, KNN outperforms the overall classifiers with average f1-score of 82.30
  • 24. • Table V summarizes the recall, precision, and f1- score in the presence of the extracted features behaviors of the users in the two language models. It is found that, the highest f1-score value is achieved by Logistic regression with f1-score value of 82% in the case of Bi-gram. • While the highest f1-score value in Tri-gram is achieved in KNN with f1-score value of 86.20%. • The KNN classifier outperforms all classifiers in terms of the overall average f1-score with the value of 83.73%. • The results reveal that KNN(K=7) outperforms the rest of classifiers in terms of f-score with the best achieving fscore 82.40%. • The result is raised by 3.80% when taking the extracted features into consideration giving the best f-score value of 86.20%.
  • 25. CONCLUSION • It is obvious that reviews play a crucial role in people’s decisions. • Thus, fake reviews detection is a vivid and ongoing research area. • In this paper, a machine learning fake reviews detection approach is presented. • In the proposed approach, both the features of the reviews and the behavioral features of the reviewers are considered. • The Yelp dataset is used to evaluate the proposed approach. • Different classifiers are implemented in the developed approach. • The Bi-gram and Trigram language models are used and compared in the developed approach.
  • 26. • The results reveal that KNN(with K=7) classifier outperforms the rest of classifiers in the fake reviews detection process. • Also, the results show that considering the behavioral features of the reviewers increase the f-score by 3.80%. • Future work may consider including other behavioral features such as features that depend on the frequent times the reviewers do the reviews, the time reviewers take to complete reviews, and how frequently they are submitting positive or negative reviews. • It is highly expected that considering more behavioral features will enhance the performance of the presented fake reviews detection approach.
  • 27. REFERENCES [1] R. Barbado, O. Araque, and C. A. Iglesias, “A framework for fake review detection in online consumer electronics retailers,” Information Processing & Management, vol. 56, no. 4, pp. 1234 – 1244, 2019. [2] S. Tadelis, “The economics of reputation and feedback systems in ecommerce marketplaces,” IEEE Internet Computing, vol. 20, no. 1, pp. 12–19, 2016. [3] M. J. H. Mughal, “Data mining: Web data mining techniques, tools and algorithms: An overview,” Information Retrieval, vol. 9, no. 6, 2018. [4] C. C. Aggarwal, “Opinion mining and sentiment analysis,” in Machine Learning for Text. Springer, 2018, pp. 413–434.
  • 28. [5] A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, “What yelp fake review filter might be doing?” in Seventh international AAAI conference on weblogs and social media, 2013. [6] N. Jindal and B. Liu, “Review spam detection,” in Proceedings of the 16th International Conference on World Wide Web, ser. WWW ’07, 2007. [7] E. Elmurngi and A. Gherbi, Detecting Fake Reviews through Sentiment Analysis Using Machine Learning Techniques. IARIA/DATA ANALYTICS, 2017.