SlideShare a Scribd company logo
Opinion Spam and Analysis
Nitin Jindal and Bing Liu
(WSDM, 2008)
1
three types of spam reviews
• Type 1 (untruthful opinions): also known as fake reviews or bogus
reviews.
2
deliberately mislead
readers or opinion mining systems
1.To promote some target objects
(hype spam).
2. To damage the reputation of
some other target objects
(defaming spam).
three types of spam reviews
• Type 2 (reviews on brands only): not comment on the products for
the products but only the brands, the manufacturers or the sellers.
• Type 3 (non-reviews): two main sub-types:
• advertisements
• other irrelevant reviews containing no opinions (questions, answers ..)
• They are not targeted at the specific products and are often biased.
3
Review Data from
amazon.com
• Each amazon.com’s review consists of 8 parts
• <Product ID> <Reviewer ID> <Rating> <Date> <Review Title> <Revie
>
4
1
2
3 45
6
7 8
5
Reviews, Reviewer, and
Products
6
Figure 1. Log-log plot of number of reviews
to number of members for amazon.
Figure 2. Log-log plot of number of reviews
to number of products for amazon.
• There are 2 reviewers with more than 15,000 reviews.
• 68% of reviewers wrote only 1 review.
• 50% of products have only 1 review.
• Only 19% of the products have at least 5 reviews.
Review Ratings and
Feedbacks
7
Figure 3. Log-log plot of number of reviews
to number of feedbacks for amazon.
Figure 4. Rating vs. percent of reviews
• 60% of the reviews have a rating of 5.0.
• On average, a review gets 7 feedbacks.
• The percentage of positive feedbacks of a review decreases rapidly
from the first review of a product to the last.
Spam Detection
• a classification problem with two classes, spam and non-spam.
• manually label training examples for spam reviews of type 2
and type 3.
• recognizing whether a review is an untruthful opinion spam
(type 1) is extremely difficult by manually reading the review.
• found a large number of duplicate and near-duplicate reviews.
• 1. Duplicates from different userids on the same product.
• 2. Duplicates from the same userid on different products.
• 3. Duplicates from different userids on different products.
8
Detection of Duplicate Reviews
• shingle method (2-grams)
• Jaccard distance (Similarity score) > 90%  duplicates.
• The maximum similarity score : the maximum of similarity
scores between different reviews of a reviewer.
9
Detecting Type 2 & Type 3
• Model Building Using Logistic Regression
• Manually labeled 40 spam reviews of the two types.
• It produces a probability estimate of each review being a spam.
10
Feature Identification and
Construction
• There are three main types of information
• (1) the content of the review,
• (2) the reviewer who wrote the review,
• (3) the product being reviewed.
• three types of features:
• (1) review centric features,
• (2) reviewer centric features,
• (3) product centric features.
• three types based on their average ratings
• Good (rating ≥ 4), bad (rating ≤ 2.5) and Average, otherwise
11
Review Centric Features
• number of
feedbacks(F1)
• number of helpful
feedbacks(F2)
• percent of helpful
feedbacks(F3)
• length of the review
title(F4)
• length of review
body(F5)
• Position of the
review of a product
sorted by date.
ascending (F6) and
descending (F7)
• the first review (F8)
• the only review (F9)
12
Review Centric Features -Textual
features• percent of positive bearing
words(F10)
• percent of negative bearing
words(F11)
• cosine similarity (F12)
• percent of times brand
name (F13)
• percent of numerals
words(F14)
• percent of capitals
words(F15)
• percent of all capital
words(F16)
13
• rating of the review (F17)
• the deviation from product rating (F18)
• the review is good, average or bad (F19)
• a bad review was written just after the first good review of
the product and vice versa (F20, F21)
Review Centric Features –Rating related features
Reviewer Centric Features
• Ratio of the number of reviews that the reviewer wrote which
were the first reviews (F22)
• ratio of the number of cases in which he/she was the only
reviewer (F23)
• average rating given by reviewer (F24)
• standard deviation in rating (F25)
• the reviewer always gave only good, average or bad rating
(F26)
• a reviewer gave both good and bad ratings (F27)
• a reviewer gave good rating and average rating (F28)
• a reviewer gave bad rating and average rating (F29)
• a reviewer gave all three ratings (F30)
• percent of times that a reviewer wrote a review with binary
features F20 (F31) and F21 (F32).
14
Product Centric Features
• Price of the product (F33)
• Sales rank of the product (F34)
• Average rating (F35)
• standard deviation in ratings (F36)
15
Results of Type 2 and Type 3 Spam
Detection
• logistic regression on the data using 470 spam reviews for
positive class and rest of the reviews for negative class.
• 10-fold cross validation.
16
Analysis of Type 1 Spam Reviews
• 1. To promote some target objects (hype spam).
• 2. To damage the reputation of some other target
objects (defaming spam).
17
Making Use of Duplicates
• the same person writes the same review for different versions
of the same product may not be spam.
• We propose to treat all duplicate spam reviews as positive
examples, and the rest of the reviews as negative examples.
• We then use them to learn a model to identify non-duplicate
reviews with similar characteristics, which are likely to be spam
reviews.
18
Model Building Using
Duplicates
• building the logistic regression model using duplicates and
non-duplicates is not for detecting duplicate spam
• Our real purpose is to use the model to identify type 1 spam
reviews that are not duplicated.
• check whether it can predict outlier reviews
19
• Outlier reviews : whose ratings deviate from the average
product rating a great deal.
• Sentiment classification techniques may be used to
automatically assign a rating to a review solely based on its
review content.
20
Predicting Outlier Reviews
• Negative deviation is considered as less than -1 from the
mean rating and positive deviation as larger than +1 from the
mean rating of the product.
• Spammers may not want their review ratings to deviate too
much from the norm to make the reviews too suspicious.
21
22
X% of reviews
(the test data)
cumulated
percentage of
reviews of the
current bin
Some Other Interesting Reviews
- Only Reviews
• We did not use any position features of reviews (F6, F7, F8, F9, F20 and
F21) and number of reviews of product (F12, F18, F35, and F36) related
features in model building to prevent overfitting.
• only reviews are very likely to be candidates of spam
23
- Reviews from Top-Ranked Reviewers
• Top-ranked reviewers generally write a large number of reviews, much
more than bottom ranked reviewers.
• Top ranked reviewers also score high on some important indicators of
spam reviews.
 top ranked reviewers are less trustworthy as compared to bottom ranked
reviewers.
24
- Reviews with Different Levels of Feedbacks
• If usefulness of a review is defined based on the feedbacks
that the review gets, it means that people can be readily
fooled by a spam review.
feedback spam
25
- Reviews of Products with Varied Sales
Ranks
• Spam activities are more limited to low selling products. 
difficult to damage reputation of a high selling or popular
product by writing a spam review.
26
Conclusions & Future work
• Results showed that the logistic regression model is highly
effective.
• It is very hard to manually label training examples for type 1
spam.
• We will further improve the detection methods, and also look
into spam in other kinds of media, e.g., forums and blogs.
27

More Related Content

What's hot

Dbms Notes Lecture 9 : Specialization, Generalization and Aggregation
Dbms Notes Lecture 9 : Specialization, Generalization and AggregationDbms Notes Lecture 9 : Specialization, Generalization and Aggregation
Dbms Notes Lecture 9 : Specialization, Generalization and Aggregation
BIT Durg
 
Chapter07
Chapter07Chapter07
Chapter07
Muhammad Ahad
 
Web Engineering- Web Application Categories & Characteristics Book Gerti Kappel
Web Engineering- Web Application Categories & Characteristics Book Gerti KappelWeb Engineering- Web Application Categories & Characteristics Book Gerti Kappel
Web Engineering- Web Application Categories & Characteristics Book Gerti Kappel
ARVIND PANDE
 
Message and Stream Oriented Communication
Message and Stream Oriented CommunicationMessage and Stream Oriented Communication
Message and Stream Oriented Communication
Dilum Bandara
 
DATABASE CONSTRAINTS
DATABASE CONSTRAINTSDATABASE CONSTRAINTS
DATABASE CONSTRAINTS
sunanditaAnand
 
Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??
Abdul Aslam
 
Client server model
Client server modelClient server model
Client server model
Gd Goenka University
 
process control block
process control blockprocess control block
process control block
Vikas SHRIVASTAVA
 
NTFS and Inode
NTFS and InodeNTFS and Inode
NTFS and Inode
Amit Seal Ami
 
Dbms 14: Relational Calculus
Dbms 14: Relational CalculusDbms 14: Relational Calculus
Dbms 14: Relational Calculus
Amiya9439793168
 
ICMP
ICMPICMP
AI Lab Manual.docx
AI Lab Manual.docxAI Lab Manual.docx
AI Lab Manual.docx
KPRevathiAsstprofITD
 
Communications is distributed systems
Communications is distributed systemsCommunications is distributed systems
Communications is distributed systems
SHATHAN
 
The Cloud Cube
The Cloud CubeThe Cloud Cube
The Cloud Cube
Adrius42
 
Lecture11 use case sequence diagram
Lecture11 use case sequence diagramLecture11 use case sequence diagram
Lecture11 use case sequence diagram
Shahid Riaz
 
Web browser architecture.87 to 88
Web browser architecture.87 to 88Web browser architecture.87 to 88
Web browser architecture.87 to 88myrajendra
 
Code Generation
Code GenerationCode Generation
Code Generation
PrabuPappuR
 
ER model to Relational model mapping
ER model to Relational model mappingER model to Relational model mapping
ER model to Relational model mappingShubham Saini
 

What's hot (20)

Dbms Notes Lecture 9 : Specialization, Generalization and Aggregation
Dbms Notes Lecture 9 : Specialization, Generalization and AggregationDbms Notes Lecture 9 : Specialization, Generalization and Aggregation
Dbms Notes Lecture 9 : Specialization, Generalization and Aggregation
 
Chapter07
Chapter07Chapter07
Chapter07
 
Web Engineering- Web Application Categories & Characteristics Book Gerti Kappel
Web Engineering- Web Application Categories & Characteristics Book Gerti KappelWeb Engineering- Web Application Categories & Characteristics Book Gerti Kappel
Web Engineering- Web Application Categories & Characteristics Book Gerti Kappel
 
Message and Stream Oriented Communication
Message and Stream Oriented CommunicationMessage and Stream Oriented Communication
Message and Stream Oriented Communication
 
DATABASE CONSTRAINTS
DATABASE CONSTRAINTSDATABASE CONSTRAINTS
DATABASE CONSTRAINTS
 
Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??
 
Client server model
Client server modelClient server model
Client server model
 
process control block
process control blockprocess control block
process control block
 
NTFS and Inode
NTFS and InodeNTFS and Inode
NTFS and Inode
 
Dbms 14: Relational Calculus
Dbms 14: Relational CalculusDbms 14: Relational Calculus
Dbms 14: Relational Calculus
 
ICMP
ICMPICMP
ICMP
 
AI Lab Manual.docx
AI Lab Manual.docxAI Lab Manual.docx
AI Lab Manual.docx
 
Communications is distributed systems
Communications is distributed systemsCommunications is distributed systems
Communications is distributed systems
 
The Cloud Cube
The Cloud CubeThe Cloud Cube
The Cloud Cube
 
Lecture11 use case sequence diagram
Lecture11 use case sequence diagramLecture11 use case sequence diagram
Lecture11 use case sequence diagram
 
Web browser architecture.87 to 88
Web browser architecture.87 to 88Web browser architecture.87 to 88
Web browser architecture.87 to 88
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Code Generation
Code GenerationCode Generation
Code Generation
 
ER model to Relational model mapping
ER model to Relational model mappingER model to Relational model mapping
ER model to Relational model mapping
 
Client server architecture
Client server architectureClient server architecture
Client server architecture
 

Viewers also liked

Libro2 personal planillas notas
Libro2 personal planillas notasLibro2 personal planillas notas
Libro2 personal planillas notascles12
 
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
SOYEON KIM
 
Opinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network EffectsOpinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network Effects
SOYEON KIM
 
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining TechniquesSurvey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
ijsrd.com
 
Common problems with email
Common problems with emailCommon problems with email
Common problems with email
Tal Aviv
 
Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)
k.surya kumar
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
anthonytaylor01
 
Credit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A SurveyCredit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A Survey
IJMER
 
Fraud Detection presentation
Fraud Detection presentationFraud Detection presentation
Fraud Detection presentation
Hernan Huwyler
 
Fraud detection
Fraud detectionFraud detection
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionkalpesh1908
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & control
Dominic Sroda Korkoryi
 

Viewers also liked (13)

Libro2 personal planillas notas
Libro2 personal planillas notasLibro2 personal planillas notas
Libro2 personal planillas notas
 
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
 
Opinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network EffectsOpinion Fraud Detection in Online Reviews by Network Effects
Opinion Fraud Detection in Online Reviews by Network Effects
 
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining TechniquesSurvey on Credit Card Fraud Detection Using Different Data Mining Techniques
Survey on Credit Card Fraud Detection Using Different Data Mining Techniques
 
FAKE Review Detection
FAKE Review DetectionFAKE Review Detection
FAKE Review Detection
 
Common problems with email
Common problems with emailCommon problems with email
Common problems with email
 
Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)Credit card fraud detection methods using Data-mining.pptx (2)
Credit card fraud detection methods using Data-mining.pptx (2)
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Credit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A SurveyCredit Card Fraud Detection System: A Survey
Credit Card Fraud Detection System: A Survey
 
Fraud Detection presentation
Fraud Detection presentationFraud Detection presentation
Fraud Detection presentation
 
Fraud detection
Fraud detectionFraud detection
Fraud detection
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Presentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & controlPresentation on fraud prevention, detection & control
Presentation on fraud prevention, detection & control
 

Similar to Opinion spam and analysis

Opinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam DetectionOpinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam Detection
IRJET Journal
 
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Syeda Kauser Inamdar
 
Mahendra nath
Mahendra nathMahendra nath
Mahendra nath
MahendraDwivedi7
 
Market and competitor research
Market and competitor researchMarket and competitor research
Market and competitor research
Sohaib Thiab
 
Fraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning TechniquesFraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning Techniques
ijceronline
 
Google: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer GroupsGoogle: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer Groups
Automotive Digital Marketing Professional Community
 
How and When To Code Review
How and When To Code ReviewHow and When To Code Review
How and When To Code Review
Paul Gower
 
Customer Content Marketing
Customer Content MarketingCustomer Content Marketing
Customer Content Marketing
Aaron Weiche
 
Control Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media BreakfastControl Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media Breakfast
Aaron Weiche
 
Online hotel reviews: rating symbols or text... text of rating symbols? That ...
Online hotel reviews: rating symbols or text... text of rating symbols? That ...Online hotel reviews: rating symbols or text... text of rating symbols? That ...
Online hotel reviews: rating symbols or text... text of rating symbols? That ...
International Federation for Information Technologies in Travel and Tourism (IFITT)
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
Kavita Ganesan
 
Evaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital EnvironmentEvaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital Environment
FlexMR
 
Code Review: How And When
Code Review: How And WhenCode Review: How And When
Code Review: How And When
Paul Gower
 
Social media metrics and ROI
Social media metrics and ROISocial media metrics and ROI
Social media metrics and ROI
InsightAsia
 
The Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollowThe Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollow
CleverTap
 
Collective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and MetadataCollective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and Metadata
Shebuti Rayana
 
1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx
jackiewalcutt
 
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom
 

Similar to Opinion spam and analysis (20)

Opinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam DetectionOpinion Mining and Opinion Spam Detection
Opinion Mining and Opinion Spam Detection
 
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
Level Up Your Python Code Reviews to Ship It Faster - WWCodeDigital - 2020
 
Mahendra nath
Mahendra nathMahendra nath
Mahendra nath
 
Market and competitor research
Market and competitor researchMarket and competitor research
Market and competitor research
 
Fraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning TechniquesFraud Detection in Online Reviews using Machine Learning Techniques
Fraud Detection in Online Reviews using Machine Learning Techniques
 
Google: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer GroupsGoogle: Spotting Fake Reviewer Groups
Google: Spotting Fake Reviewer Groups
 
Npd
NpdNpd
Npd
 
How and When To Code Review
How and When To Code ReviewHow and When To Code Review
How and When To Code Review
 
Customer Content Marketing
Customer Content MarketingCustomer Content Marketing
Customer Content Marketing
 
Control Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media BreakfastControl Your Online Reputation - MSP Social Media Breakfast
Control Your Online Reputation - MSP Social Media Breakfast
 
Online hotel reviews: rating symbols or text... text of rating symbols? That ...
Online hotel reviews: rating symbols or text... text of rating symbols? That ...Online hotel reviews: rating symbols or text... text of rating symbols? That ...
Online hotel reviews: rating symbols or text... text of rating symbols? That ...
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Evaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital EnvironmentEvaluating Brand Attributes in a Digital Environment
Evaluating Brand Attributes in a Digital Environment
 
Code Review: How And When
Code Review: How And WhenCode Review: How And When
Code Review: How And When
 
Social media metrics and ROI
Social media metrics and ROISocial media metrics and ROI
Social media metrics and ROI
 
The Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollowThe Journey from User Acquisition to Advocacy - CleverTap & AppFollow
The Journey from User Acquisition to Advocacy - CleverTap & AppFollow
 
Collective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and MetadataCollective Opinion Spam Detection Bridging Review Networks and Metadata
Collective Opinion Spam Detection Bridging Review Networks and Metadata
 
1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx1. Click here to retrieve the Risk Management Template. Working wi.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx
 
GH_Final1.1
GH_Final1.1GH_Final1.1
GH_Final1.1
 
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
 

More from SOYEON KIM

Network-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal dataNetwork-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal data
SOYEON KIM
 
Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...
SOYEON KIM
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
SOYEON KIM
 
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
SOYEON KIM
 
Network embedding
Network embeddingNetwork embedding
Network embedding
SOYEON KIM
 
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
SOYEON KIM
 
Deep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a surveyDeep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a survey
SOYEON KIM
 
DeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsDeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social Representations
SOYEON KIM
 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
SOYEON KIM
 
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image SearchVisual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
SOYEON KIM
 
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
SOYEON KIM
 
A survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisA survey of heterogeneous information network analysis
A survey of heterogeneous information network analysis
SOYEON KIM
 
Translated learning
Translated learningTranslated learning
Translated learning
SOYEON KIM
 
Self taught clustering
Self taught clusteringSelf taught clustering
Self taught clustering
SOYEON KIM
 
Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...
SOYEON KIM
 
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
SOYEON KIM
 
Text extraction from natural scene image, a survey
Text extraction from natural scene image, a surveyText extraction from natural scene image, a survey
Text extraction from natural scene image, a survey
SOYEON KIM
 
Evaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognitionEvaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognition
SOYEON KIM
 
Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...
SOYEON KIM
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
SOYEON KIM
 

More from SOYEON KIM (20)

Network-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal dataNetwork-based machine learning approach for aggregating multi-modal data
Network-based machine learning approach for aggregating multi-modal data
 
Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...Revealing disease-associated pathways by network integration of untargeted me...
Revealing disease-associated pathways by network integration of untargeted me...
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
 
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
 
Network embedding
Network embeddingNetwork embedding
Network embedding
 
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
 
Deep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a surveyDeep learning based multi-omics integration, a survey
Deep learning based multi-omics integration, a survey
 
DeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsDeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social Representations
 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
 
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image SearchVisual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
 
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
 
A survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisA survey of heterogeneous information network analysis
A survey of heterogeneous information network analysis
 
Translated learning
Translated learningTranslated learning
Translated learning
 
Self taught clustering
Self taught clusteringSelf taught clustering
Self taught clustering
 
Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...Semi-automatic ground truth generation using unsupervised clustering and limi...
Semi-automatic ground truth generation using unsupervised clustering and limi...
 
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
 
Text extraction from natural scene image, a survey
Text extraction from natural scene image, a surveyText extraction from natural scene image, a survey
Text extraction from natural scene image, a survey
 
Evaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognitionEvaluating color descriptors for object and scene recognition
Evaluating color descriptors for object and scene recognition
 
Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...Outcome-guided mutual information networks for investigating gene-gene intera...
Outcome-guided mutual information networks for investigating gene-gene intera...
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

Opinion spam and analysis

  • 1. Opinion Spam and Analysis Nitin Jindal and Bing Liu (WSDM, 2008) 1
  • 2. three types of spam reviews • Type 1 (untruthful opinions): also known as fake reviews or bogus reviews. 2 deliberately mislead readers or opinion mining systems 1.To promote some target objects (hype spam). 2. To damage the reputation of some other target objects (defaming spam).
  • 3. three types of spam reviews • Type 2 (reviews on brands only): not comment on the products for the products but only the brands, the manufacturers or the sellers. • Type 3 (non-reviews): two main sub-types: • advertisements • other irrelevant reviews containing no opinions (questions, answers ..) • They are not targeted at the specific products and are often biased. 3
  • 4. Review Data from amazon.com • Each amazon.com’s review consists of 8 parts • <Product ID> <Reviewer ID> <Rating> <Date> <Review Title> <Revie > 4 1 2 3 45 6 7 8
  • 5. 5
  • 6. Reviews, Reviewer, and Products 6 Figure 1. Log-log plot of number of reviews to number of members for amazon. Figure 2. Log-log plot of number of reviews to number of products for amazon. • There are 2 reviewers with more than 15,000 reviews. • 68% of reviewers wrote only 1 review. • 50% of products have only 1 review. • Only 19% of the products have at least 5 reviews.
  • 7. Review Ratings and Feedbacks 7 Figure 3. Log-log plot of number of reviews to number of feedbacks for amazon. Figure 4. Rating vs. percent of reviews • 60% of the reviews have a rating of 5.0. • On average, a review gets 7 feedbacks. • The percentage of positive feedbacks of a review decreases rapidly from the first review of a product to the last.
  • 8. Spam Detection • a classification problem with two classes, spam and non-spam. • manually label training examples for spam reviews of type 2 and type 3. • recognizing whether a review is an untruthful opinion spam (type 1) is extremely difficult by manually reading the review. • found a large number of duplicate and near-duplicate reviews. • 1. Duplicates from different userids on the same product. • 2. Duplicates from the same userid on different products. • 3. Duplicates from different userids on different products. 8
  • 9. Detection of Duplicate Reviews • shingle method (2-grams) • Jaccard distance (Similarity score) > 90%  duplicates. • The maximum similarity score : the maximum of similarity scores between different reviews of a reviewer. 9
  • 10. Detecting Type 2 & Type 3 • Model Building Using Logistic Regression • Manually labeled 40 spam reviews of the two types. • It produces a probability estimate of each review being a spam. 10
  • 11. Feature Identification and Construction • There are three main types of information • (1) the content of the review, • (2) the reviewer who wrote the review, • (3) the product being reviewed. • three types of features: • (1) review centric features, • (2) reviewer centric features, • (3) product centric features. • three types based on their average ratings • Good (rating ≥ 4), bad (rating ≤ 2.5) and Average, otherwise 11
  • 12. Review Centric Features • number of feedbacks(F1) • number of helpful feedbacks(F2) • percent of helpful feedbacks(F3) • length of the review title(F4) • length of review body(F5) • Position of the review of a product sorted by date. ascending (F6) and descending (F7) • the first review (F8) • the only review (F9) 12
  • 13. Review Centric Features -Textual features• percent of positive bearing words(F10) • percent of negative bearing words(F11) • cosine similarity (F12) • percent of times brand name (F13) • percent of numerals words(F14) • percent of capitals words(F15) • percent of all capital words(F16) 13 • rating of the review (F17) • the deviation from product rating (F18) • the review is good, average or bad (F19) • a bad review was written just after the first good review of the product and vice versa (F20, F21) Review Centric Features –Rating related features
  • 14. Reviewer Centric Features • Ratio of the number of reviews that the reviewer wrote which were the first reviews (F22) • ratio of the number of cases in which he/she was the only reviewer (F23) • average rating given by reviewer (F24) • standard deviation in rating (F25) • the reviewer always gave only good, average or bad rating (F26) • a reviewer gave both good and bad ratings (F27) • a reviewer gave good rating and average rating (F28) • a reviewer gave bad rating and average rating (F29) • a reviewer gave all three ratings (F30) • percent of times that a reviewer wrote a review with binary features F20 (F31) and F21 (F32). 14
  • 15. Product Centric Features • Price of the product (F33) • Sales rank of the product (F34) • Average rating (F35) • standard deviation in ratings (F36) 15
  • 16. Results of Type 2 and Type 3 Spam Detection • logistic regression on the data using 470 spam reviews for positive class and rest of the reviews for negative class. • 10-fold cross validation. 16
  • 17. Analysis of Type 1 Spam Reviews • 1. To promote some target objects (hype spam). • 2. To damage the reputation of some other target objects (defaming spam). 17
  • 18. Making Use of Duplicates • the same person writes the same review for different versions of the same product may not be spam. • We propose to treat all duplicate spam reviews as positive examples, and the rest of the reviews as negative examples. • We then use them to learn a model to identify non-duplicate reviews with similar characteristics, which are likely to be spam reviews. 18
  • 19. Model Building Using Duplicates • building the logistic regression model using duplicates and non-duplicates is not for detecting duplicate spam • Our real purpose is to use the model to identify type 1 spam reviews that are not duplicated. • check whether it can predict outlier reviews 19
  • 20. • Outlier reviews : whose ratings deviate from the average product rating a great deal. • Sentiment classification techniques may be used to automatically assign a rating to a review solely based on its review content. 20
  • 21. Predicting Outlier Reviews • Negative deviation is considered as less than -1 from the mean rating and positive deviation as larger than +1 from the mean rating of the product. • Spammers may not want their review ratings to deviate too much from the norm to make the reviews too suspicious. 21
  • 22. 22 X% of reviews (the test data) cumulated percentage of reviews of the current bin
  • 23. Some Other Interesting Reviews - Only Reviews • We did not use any position features of reviews (F6, F7, F8, F9, F20 and F21) and number of reviews of product (F12, F18, F35, and F36) related features in model building to prevent overfitting. • only reviews are very likely to be candidates of spam 23
  • 24. - Reviews from Top-Ranked Reviewers • Top-ranked reviewers generally write a large number of reviews, much more than bottom ranked reviewers. • Top ranked reviewers also score high on some important indicators of spam reviews.  top ranked reviewers are less trustworthy as compared to bottom ranked reviewers. 24
  • 25. - Reviews with Different Levels of Feedbacks • If usefulness of a review is defined based on the feedbacks that the review gets, it means that people can be readily fooled by a spam review. feedback spam 25
  • 26. - Reviews of Products with Varied Sales Ranks • Spam activities are more limited to low selling products.  difficult to damage reputation of a high selling or popular product by writing a spam review. 26
  • 27. Conclusions & Future work • Results showed that the logistic regression model is highly effective. • It is very hard to manually label training examples for type 1 spam. • We will further improve the detection methods, and also look into spam in other kinds of media, e.g., forums and blogs. 27

Editor's Notes

  1. We consider them as spam because they are not targeted at the specific products and are often biased.
  2. &amp;lt;number&amp;gt;
  3. We can use the probability to weight each review. Since the probability reflects the likelihood that a review is a spam, those reviews with high probabilities can be weighted down to reduce their effects on opinion mining.
  4. 1.3.5 write by owners or manufacturers. 2.4.6 write by competititors 1.4 are not damaging, 2.3.5.6 are very harmful. spam detection techniques should focus on identifying reviews in these regions.
  5. Since the number of such duplicate spam reviews is quite large, it is reasonable to assume that they are a fairly good and random sample of many kinds of spam.
  6. To further confirm its predictability, we want to show that it is also predictive of reviews that are more likely to be spam and are not duplicated, i.e., outlier reviews. They are more likely to be spam reviews than average reviews because high rating deviation is a necessary condition for a harmful spam review but not sufficient because some reviewers may truly have different views from others.