Prevalence of Deception in Online Customer Reviews


Published on

Prevalence of Deception in Online Customer Reviews

Published in: Automotive, Education, Technology
1 Comment
  • You should always look for reviews that have been submitted via Testimonial Shield's review and testimonial filters. They are an independent company and have no affiliation with any review sites. They manually verify the reviews and reviewers to confirm that the review is authentic. Yelp reviews that have been verified by Testimonial Shield read: 'This review has been verified by' at the end of the review. Testimonial Shield also filters reviews for Google+, Trip Advisor, medical and dental websites, etc.
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Prevalence of Deception in Online Customer Reviews

  1. 1. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, France Estimating the Prevalence of Deception in Online Review Communities Myle Ott Claire Cardie Jeff Hancock Dept. of Computer Science Depts. of Computer Science Depts. of Communication and Cornell University and Information Science Information Science Ithaca, NY 14850 Cornell University Cornell University Ithaca, NY 14850 Ithaca, NY 14850 jeff.hancock@cornell.eduABSTRACT 1. INTRODUCTIONConsumers’ purchase decisions are increasingly influenced Consumers rely increasingly on user-generated online re-by user-generated online reviews [3]. Accordingly, there has views to make, or reverse, purchase decisions [3]. Accord-been growing concern about the potential for posting decep- ingly, there appears to be widespread and growing concerntive opinion spam—fictitious reviews that have been deliber- among both businesses and the public [12, 14, 16, 19, 20,ately written to sound authentic, to deceive the reader [15]. 21] regarding the potential for posting deceptive opinionBut while this practice has received considerable public at- spam—fictitious reviews that have been deliberately writ-tention and concern, relatively little is known about the ac- ten to sound authentic, to deceive the reader [15]. Perhapstual prevalence, or rate, of deception in online review com- surprisingly, however, relatively little is known about the ac-munities, and less still about the factors that influence it. tual prevalence, or rate, of deception in online review com- We propose a generative model of deception which, in con- munities, and less still is known about the factors that canjunction with a deception classifier [15], we use to explore the influence it. On the one hand, the relative ease of producingprevalence of deception in six popular online review commu- reviews, combined with the pressure for businesses, prod-nities: Expedia,, Orbitz, Priceline, TripAdvisor, ucts, and services to be perceived in a positive light, mightand Yelp. We additionally propose a theoretical model of lead one to expect that a preponderance of online reviewsonline reviews based on economic signaling theory [18], in are fake. One can argue, on the other hand, that a low ratewhich consumer reviews diminish the inherent information of deception is required for review sites to serve any value.1asymmetry between consumers and producers, by acting as a The focus of spam research in the context of online reviewssignal to a product’s true, unknown quality. We find that de- has been primarily on detection. Jindal and Liu [8], forceptive opinion spam is a growing problem overall, but with example, train models using features based on the reviewdifferent growth rates across communities. These rates, we text, reviewer, and product to identify duplicate opinions.2argue, are driven by the different signaling costs associated Yoo and Gretzel [23] gather 40 truthful and 42 deceptivewith deception for each review community, e.g., posting re- hotel reviews and, using a standard statistical test, manuallyquirements. When measures are taken to increase signaling compare the psychologically relevant linguistic differencescost, e.g., filtering reviews written by first-time reviewers, between them. While useful, these approaches do not focusdeception prevalence is effectively reduced. on the prevalence of deception in online reviews. Indeed, empirical, scholarly studies of the prevalence of deceptive opinion spam have remained elusive. One reasonCategories and Subject Descriptors is the difficulty in obtaining reliable gold-standard annota-I.2.7 [Artificial Intelligence]: Natural Language Process- tions for reviews, i.e., trusted labels that tag each reviewing; J.4 [Computer Applications]: Social and Behavioral as either truthful (real) or deceptive (fake). One option forSciences—economics, psychology; K.4.1 [Computers and producing gold-standard labels, for example, would be toSociety]: Public Policy Issues—abuse and crime involving rely on the judgements of human annotators. Recent stud-computers; K.4.4 [Computers and Society]: Electronic ies, however, show that deceptive opinion spam is not eas-Commerce ily identified by human readers [15]; this is especially the case when considering the overtrusting nature of most hu- man judges, a phenomenon referred to in the psychologicalGeneral TermsAlgorithms, Experimentation, Measurement, Theory 1 It is worth pointing out that a review site containing de- ceptive reviews might still serve value, for example, if there remains enough truthful content to produce reasonable ag-Keywords gregate comparisons between offerings. 2Deceptive opinion spam, Deception prevalence, Gibbs sam- Duplicate (or near-duplicate) opinions are opinions thatpling, Online reviews, Signaling theory appear more than once in the corpus with the same (or sim- ilar) text. However, simply because a review is duplicatedCopyright is held by the International World Wide Web Conference Com- does not make it deceptive. Furthermore, it seems unlikelymittee (IW3C2). Distribution of these papers is limited to classroom use, that either duplication or plagiarism characterizes the ma-and personal use by others. jority of fake reviews. Moreover, such reviews are potentiallyWWW 2012, April 16–20, 2012, Lyon, France. detectable via off-the-shelf plagiarism detection software.ACM 978-1-4503-1229-5/12/04. 201
  2. 2. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, Francedeception literature as a truth bias [22]. To help illustrate which we perform via Gibbs sampling, allows us to estimatethe non-trivial nature of identifying deceptive content, given the prevalence of deception in the underlying review commu-below are two positive reviews of the Hilton Chicago Hotel, nity, without relying on either self-reports or gold-standardone of which is truthful, and the other of which is deceptive annotations.opinion spam: We further propose a theoretical component to the frame- work based on signaling theory from economics [18] (see Sec- 1. “My husband and I stayed in the Hilton Chicago and tion 6) and use it to reason about the factors that influence had a very nice stay! The rooms were large and com- deception prevalence in online review communities. In our fortable. The view of Lake Michigan from our room context, signaling theory interprets each review as a signal was gorgeous. Room service was really good and quick, to the product’s true, unknown quality; thus, the goal of eating in the room looking at that view, awesome! The consumer reviews is to diminish the inherent information pool was really nice but we didnt get a chance to use it. asymmetry between consumers and producer. Very briefly, Great location for all of the downtown Chicago attrac- according to a signaling theory approach, deception preva- tions such as theaters and museums. Very friendly staff lence should be a function of the costs and benefits that and knowledgable, you cant go wrong staying here.” accrue from producing a fake review. We hypothesize that review communities with low signaling cost, such as commu- 2. “We loved the hotel. When I see other posts about nities that make it easy to post a review, and large benefits, it being shabby I can’t for the life of me figure out such as highly trafficked sites, will exhibit more deceptive what they are talking about. Rooms were large with opinion spam than those with higher signaling costs, such TWO bathrooms, lobby was fabulous, pool was large as communities that establish additional requirements for with two hot tubs and huge gym, staff was courteous. posting reviews, and lower benefits, such as low site traffic. For us, the location was great–across the street from We apply our approach to the domain of hotel reviews. Grant Park with a great view of Buckingham Fountain In particular, we examine hotels from the Chicago area, re- and close to all the museums and theatres. I’m sure stricting attention to positive reviews only, and instantiate others would rather be north of the river closer to the the framework for six online review communities (see Sec- Magnificent Mile but we enjoyed the quieter and more tion 5): Expedia (, scenic location. Got it for $105 on Hotwire. What a (, Orbitz (, bargain for such a nice hotel.” Priceline (, TripAdvisor (http: //, and Yelp ( Answer: See footnote.3 We find first that the prevalence of deception indeed varies by community. However, because it is not possible to vali- The difficulty of detecting which of these reviews is fake date these estimates empirically (i.e., the gold-standard rateis consistent with recent large meta-analyses demonstrating of deception in each community is unknown), we focus ourthe inaccuracy of human judgments of deception, with accu- discussion instead on the relative differences in the rate ofracy rates typically near chance [1]. In particular, humans deception between communities. Here, the results confirmhave a difficult time identifying deceptive messages from our hypotheses and suggest that deception is most prevalentcues alone, and as such, it is not surprising that research on in communities with a low signal cost. Importantly, whenestimating the prevalence of deception (see Section 8.2) has measures are taken to increase a community’s signal cost,generally relied on self-report methods, even though such we find dramatic reductions in our estimates of the rate ofreports are difficult and expensive to obtain, especially in deception in that community.large-scale settings, e.g., the web [5]. More importantly, self-report methods, such as diaries and large-scale surveys, haveseveral methodological concerns, including social desirability 2. FRAMEWORKbias and self-deception [4]. Furthermore, there are consid- In this section, we propose a framework to estimate theerable disincentives to revealing one’s own deception in the prevalence, or rate, of deception among reviews in six on-case of online reviews, such as being permanently banned line review communities. Since reviews in these communi-from a review portal, or harming a business’s reputation. ties do not have gold-standard annotations of deceptiveness, Recently, automated approaches (see Section 4.1) have and neither human judgements nor self-reports of deceptionemerged to reliably label reviews as truthful vs. deceptive: are reliable in this setting (see discussion in Section 1), ourOtt et al. [15] train an n-gram–based text classifier using a framework instead estimates the rates of deception in thesecorpus of truthful and deceptive reviews—the former culled communities using the output of an imperfect, automatedfrom online review communities and the latter generated deception classifier. In particular, we utilize a supervisedusing Amazon Mechanical Turk ( machine learning classifier, which has been shown recentlyTheir resulting classifier is nearly 90% accurate. by Ott et al. [15] to be nearly 90% accurate at detecting In this work, we present a general framework (see Sec- deceptive opinion spam in a class-balanced dataset.tion 2) for estimating the prevalence of deception in online A similar framework has been used previously in stud-review communities. Given a classifier that distinguishes ies of disease prevalence, in which gold-standard diagnostictruthful from deceptive reviews (like that described above), testing is either too expensive, or impossible to perform [9,and inspired by studies of disease prevalence [9, 10], we pro- 10]. In such cases, it is therefore necessary to estimate thepose a generative model of deception (see Section 3) that prevalence of disease in the population using a combinationjointly models the classifier’s uncertainty as well as the ground- of an imperfect diagnostic test, and estimates of the test’struth deceptiveness of each review. Inference for this model, positive and negative recall rates.43 4 The first review is deceptive opinion spam. Recall rates of an imperfect diagnostic test are unlikely to 202
  3. 3. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, France Our proposed framework is summarized here, with eachstep discussed in greater detail in the corresponding section: α 1. Data (Section 5): Assume given a set of labeled training reviews, Dtrain = {(xi , yi )}N train i=1 , where, for each review i, yi ∈ {0, 1} π* gives the review’s label (0 for truthful, 1 for deceptive), and xi ∈ R|V | gives the review’s feature vector repre- sentation, for some feature space of size |V |. Similarly, assume given a set of labeled truthful development re- dev β yi γ views, Ddev = {(xi , 0)}N , and a set of unlabeled test i=1 test N test reviews, D = {xi }i=1 . 2. Deception Classifier (Section 4.1): η* f (xi ) θ* Using the labeled training reviews, Dtrain , learn a su- pervised deception classifier, f : R|V | → {0, 1}. Ntest 3. Classifier Sensitivity and Specificity (Section 4.2): Figure 1: The Bayesian Prevalence Model in plate By cross-validation on Dtrain , estimate the sensitivity notation. Shaded nodes represent observed vari- (deceptive recall) of the deception classifier, f , as: ables, and arrows denote dependence. For example, η = Pr(f (xi ) = 1 | yi = 1). (1) f (xi ) is observed, and depends on η ∗ , θ∗ , and yi . Then, use Ddev to estimate the specificity (truthful recall) of the deception classifier, f , as: specificity (truthful recall) of f be given by η and θ, respec- θ = Pr(f (xi ) = 0 | yi = 0). (2) tively. Then, we can write the expectation of πf as:   4. Prevalence Models (Section 3): 1 E[πf ] = E  test δ[f (x) = 1] N Finally, use f , η, θ, and either the Na¨ Prevalence ıve test x∈D Model (Section 3.1), or the generative Bayesian Preva- 1 lence Model (Section 3.2), to estimate the prevalence = E [δ [f (x) = 1]] N test of deception, denoted π, among reviews in Dtest . Note x∈D test = ηπ ∗ + (1 − θ)(1 − π ∗ ), test that if we had gold-standard labels, {yi }N , the gold- i=1 (4) standard prevalence of deception would be: ∗ where π is the true (latent) rate of deception, and δ[a = b] N test is the Kronecker delta function, which is equal to 1 when 1 π∗ = yi . (3) a = b, and 0 otherwise. N test i=1 If we rearrange Equation 4 in terms of π ∗ , and replace the expectation of πf with the observed value, we get the Na¨ ıve3. PREVALENCE MODELS Prevalence Model estimator: In Section 2, we propose a framework to estimate the πf − (1 − θ) πna¨ = ıve . (5)prevalence of deception in a group of reviews using only η − (1 − θ)the output of a noisy deception classifier. Central to thisframework is the Prevalence Model, which models the un- Intuitively, Equation 5 corrects the raw classifier output,certainty of the deception classifier, and ultimately produces given by πf , by subtracting from it the false positive rate,the desired prevalence estimate. In this section, we propose given by 1 − θ, and dividing the result by the differencetwo competing Prevalence Models, which can be used inter- between the true and false positive rates, given by η−(1−θ).changeably in our framework. Notice that when f is an oracle,5 i.e., when η = θ = 1, the Na¨ Prevalence Model estimate correctly reduces to the ıve3.1 Naïve Prevalence Model oracle rate given by f , i.e., πna¨ = πf = π ∗ . ıve The Na¨ Prevalence Model (na¨ ıve ıve) estimates the preva-lence of deception in a corpus of reviews by correcting the 3.2 Bayesian Prevalence Modeloutput of a noisy deception classifier according to the clas- Unfortunately, the Na¨ Prevalence Model estimate, πna¨ , ıve ıvesifier’s known performance characteristics. is not restricted to the range [0, 1]. Specifically, it is negative Formally, for a given deception classifier, f , let πf be the when πf < 1 − θ, and greater than 1 when πf > η. Further-number of reviews in Dtest for which f makes a positive more, the Na¨ Prevalence Model makes the unrealistic as- ıveprediction, i.e., the number of reviews for which f predicts sumption that the estimates of the classifier’s sensitivity (η)deceptive. Also, let the sensitivity (deceptive recall) and and specificity (θ), obtained using the procedure discussed in Section 4.2 and Appendix B, are known precisely. However, imprecise estimates can often 5be obtained, especially in cases where it is feasible to perform An oracle is a classifier that does not make mistakes, andgold-standard testing on a small subpopulation. always predicts the true, gold-standard label. 203
  4. 4. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, France The Bayesian Prevalence Model (bayes) addresses theselimitations by modeling the generative process through which Table 1: Reference 5-fold cross-validated perfor-deception occurs, or, equivalently, the joint probability dis- mance of an SVM deception detection classifier intribution of the observed and latent data. In particular, a balanced dataset of TripAdvisor reviews, given bybayes models the observed classifier output, the true (la- Ott et al. [15]. F-score corresponds to the harmonictent) rate of deception (π ∗ ), as well as the classifier’s true mean of precision and recall.(latent) sensitivity (η ∗ ) and specificity (θ∗ ). Formally, bayes metric performanceassumes that our data was generated according to the fol- Accuracy 89.6%lowing generative story: Deceptive Precision 89.1% • Sample the true rate of deception: π ∗ ∼ Beta(α) Deceptive Recall 90.3% Deceptive F-score 89.7% • Sample the classifier’s true sensitivity: η ∗ ∼ Beta(β) Truthful Precision 90.1% Truthful Recall 89.0% • Sample the classifier’s true specificity: θ∗ ∼ Beta(γ) Truthful F-score 89.6% • For each review i: Baseline Accuracy 50% – Sample the ground-truth deception label: yi ∼ Bernoulli(π ∗ ) where each term is given according to the sampling distri- butions specified in the generative story in Section 3.2. – Sample the classifier’s output: A common technique to simplify the joint distribution, Bernoulli(η ∗ ) if yi = 1 and the sampling process, is to integrate out (collapse) vari- f (xi ) ∼ ables that do not need to be sampled. If we integrate out π ∗ , Bernoulli(1 − θ∗ ) if yi = 0 η ∗ , and θ∗ from Equation 6, we can derive a Gibbs sampler The corresponding graphical model is given in plate nota- that only needs to sample the yi ’s at each iteration. The re-tion in Figure 1. Notice that by placing Beta prior distribu- sulting sampling equations, and the corresponding Bayesiantions on π ∗ , η ∗ , and θ∗ , bayes enables us to encode our prior Prevalence Model estimate of the prevalence of deception,knowledge about the true rate of deception, as well as our πbayes , are given in greater detail in Appendix A.uncertainty about the estimates of the classifier’s sensitivityand specificity. This is discussed further in Section 4.2. 4. DECEPTION DETECTION A similar model has been proposed by Joseph et al. [10]for studies of disease prevalence, in which it is necessary to 4.1 Deception Classifierestimate the prevalence of disease in a population given only The next component of the framework given in Section 2an imperfect diagnostic test. However, that model samples is the deception classifier, which predicts whether each unla-the total number of true positives and false negatives, while beled review is truthful (real) or deceptive (fake). Followingour model samples the yi individually. Accordingly, while previous work [15], we assume given some amount of labeledpilot experiments confirm that the two models produce iden- training reviews, so that we can train deception classifierstical results, the generative story of our model, given above, using a supervised learning comparatively much more intuitive. Previous work has shown that Support Vector Machines3.2.1 Inference (SVM) trained on n-gram features perform well in decep- tion detection tasks [8, 13, 15]. Following Ott et al. [15], While exact inference is intractable for the Bayesian Preva- we train linear SVM classifiers using the LIBSVM [2] soft-lence Model, a popular alternative way of approximating the ware package, and represent reviews using unigram and bi-desired posterior distribution is with Markov Chain Monte gram bag-of-words features. While more sophisticated andCarlo (MCMC) sampling, and more specifically Gibbs sam- purpose-built classifiers might achieve better performance,pling. Gibbs sampling works by sampling each variable, in pilot experiments suggest that the Prevalence Models (seeturn, from the conditional distribution of that variable given Section 3) are not heavily affected by minor differences inall other variables in the model. After repeating this proce- classifier performance. Furthermore, the simple approachdure for a fixed number of iterations, the desired posterior just outlined has been previously evaluated to be nearly 90%distribution can be approximated from samples in the chain accurate at detecting deception in a balanced dataset [15].by: (1) discarding a number of initial burn-in iterations, and Reference cross-validated classifier performance appears in(2) since adjacent samples in the chain are often highly cor- Table 1.related, thinning the number of remaining samples accordingto a sampling lag. 4.2 Classifier Sensitivity and Specificity The conditional distributions of each variable given the Both Prevalence Models introduced in Section 3 can uti-others can be derived from the joint distribution, which can lize knowledge of the underlying deception classifier’s sensi-be read directly from the graph. Based on the graphical tivity (η ∗ ), i.e., deceptive recall rate, and specificity (θ∗ ), i.e.,representation of bayes, given in Figure 1, the joint distri- truthful recall rate. While it is not possible to obtain gold-bution of the observed and latent variables is just: standard values for these parameters, we can obtain rough Pr(f (x), y, π ∗ , η ∗ , θ∗ ; α, β, γ) = Pr(f (x) | y, η ∗ , θ∗ )· estimates of their values (denoted η and θ, respectively) through a combination of cross-validation, and evaluation Pr(y | π ∗ ) · Pr(π ∗ | α) · Pr(η ∗ | β) · Pr(θ∗ | γ), (6) on a labeled development set. For the Na¨ Prevalence ıve 204
  5. 5. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, FranceTable 2: Corpus statistics for unlabeled test reviews Table 3: Signal costs associated with six online re-from six online review communities. view communities, sorted approximately from high- est signal cost to lowest. Posting cost is High if users community # hotels # reviews are required to purchase a product before review- Expedia 100 4,341 ing it, and Low otherwise. Exposure benefit is Low, 103 6,792 Medium, or High based on the number of reviews in Orbitz 97 1,777 the community (see Table 2). Priceline 98 4,027 TripAdvisor 104 9,602 community posting cost exposure benefit Yelp 103 1,537 Orbitz High Low Mechanical Turk 20 400 Priceline High Medium Expedia High Medium High Medium Yelp Low LowModel, the estimates are used directly, and are assumed to TripAdvisor Low Highbe exact. For the Bayesian Prevalence Model, we adopt an empiri-cal Bayesian approach and use the estimates to inform thecorresponding Beta priors via their hyperparameters, β and is smaller for heavily-reviewed products, and that thereforeγ, respectively. The full procedure is given in Appendix B. spam should be less common among them. For consistency with our labeled deceptive review data, we simply label as5. DATA truthful all positive (5-star) reviews of the 20 previously In this section, we briefly discuss each of the three kinds of chosen Chicago hotels. We then draw a random sample ofdata used by our framework introduced in Section 2. Corpus size 400, and take that to be our labeled truthful trainingstatistics are given in Table 2. Following Ott et al. [15], we data.excluded all reviews with fewer than 150 characters, as wellas all non-English reviews.6 5.2 Development Reviews (Ddev ) By training on deceptive and truthful reviews from the5.1 Training Reviews (Dtrain ) same 20 hotels, we are effectively controlling our classifier Training a supervised deception classifier requires labeled for topic. However, because this training data is not repre-training data. Following Ott et al. [15], we build a balanced sentative of Chicago hotel reviews in general, it is importantset of 800 training reviews, containing 400 truthful reviews that we do not use it to estimate the resulting classifier’sfrom six online review communities, and 400 gold-standard specificity (truthful recall). Accordingly, as specified in ourdeceptive reviews from Amazon Mechanical Turk. framework (Section 2), classifier specificity is instead esti- Deceptive Reviews: In Section 1, we discuss some of the mated on a separate, labeled truthful development set, whichdifficulties associated with obtaining gold-standard labels of we draw uniformly at random from the unlabeled reviews indeception, including the inaccuracy of human judgements, each review community. For consistency with the sensitivityand the problems with self-reports of deception. To avoid estimate, the size of the draw is always 400 reviews.these difficulties, Ott et al. [15] have recently created 400gold-standard deceptive reviews using Amazon’s Mechan- 5.3 Test Reviews (Dtest )ical Turk service. In particular, they paid one US dol- The last data component of our framework is the set oflar ($1) to each of 400 unique Mechanical Turk workers test reviews, among which to estimate the prevalence of de-to write a fake positive (5-star) review for one of the 20 ception. To avoid evaluating reviews that are too differentmost heavily-reviewed Chicago hotels on TripAdvisor. Each from our training data in either sentiment (due to negativeworker was given a link to the hotel’s website, and instructed reviews), or topic (due to reviews of hotels outside Chicago),to write a convincing review from the perspective of a satis- we constrain each community’s test set to contain only pos-fied customer. Any submission found to be plagiarized was itive (5-star) Chicago hotel reviews. This unfortunately dis-rejected. Any submission with fewer than 150 characters qualifies our estimates of each community’s prevalence ofwas discarded. To date, this is the only publicly-available7 deception from being representative of all hotel reviews. No-gold-standard deceptive opinion spam dataset. As such, we tably, estimates of the prevalence of deception among nega-choose it to be our sole source of labeled deceptive reviews tive reviews might be very different from our estimates, duefor training our supervised deception classifiers. Note that to the distinct motives of posting deceptive positive vs. neg-these same reviews are used to estimate the resulting clas- ative reviews. We discuss this further in Section 9.sifier sensitivity (deceptive recall), via the cross-validationprocedure given in Appendix B. 6. SIGNAL THEORY Truthful Reviews: Many of the same challenges that make In terms of economic theory, the role of review commu-it difficult to obtain gold-standard deceptive reviews, also nities is to reduce the inherent information asymmetry [18]apply to obtaining truthful reviews. Related work [8, 11] between buyers and sellers in online marketplaces, by provid-has hypothesized that the relative impact of spam reviews ing buyers with a priori knowledge of the underlying quality6 of the products being sold [7]. It follows that if reviews regu- Language was identified by the Language Detection Li-brary: larly failed to reduce this information asymmetry, or, worse,7 convey false information, then they would cease to be of 205
  6. 6. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, France 6%   6%   6%   4%   4%   4%   2%   2%   2%   0%   0%   0%   -­‐2%   -­‐2%   -­‐2%   -­‐4%   -­‐4%   -­‐4%   -­‐6%   -­‐6%   -­‐6%   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   (a) Orbitz (b) Priceline (c) Expedia 6%   6%   6%   4%   4%   4%   2%   2%   2%   0%   0%   0%   -­‐2%   -­‐2%   -­‐2%   -­‐4%   -­‐4%   -­‐4%   -­‐6%   -­‐6%   -­‐6%   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   (d) (e) Yelp (f) TripAdvisorFigure 2: Graph of Na¨ estimates of deception prevalence versus time, for six online review communities. ıveBlue (a–d) and red (e–f ) graphs correspond to high and low posting cost communities, respectively.value to the user. Given that review communities are, in these factors for each of the six review communities is givenfact, valued by users [3], it seems unlikely that the preva- in Table 3.lence of deception among them is large. Based on the signal cost function just defined, we propose Nonetheless, there is widespread concern about the preva- two hypotheses:lence of deception in online reviews, rightly or wrongly, andfurther, deceptive reviews can be cause for concern even in • Hypothesis 1 : Review communities that have low sig-small quantities, e.g., if they are concentrated in a single nal costs (low posting requirements, high exposure),review community. We propose that by framing reviews as e.g., TripAdvisor and Yelp, will have more deceptionsignals—voluntary communications that serve to convey in- than communities with high signal costs, e.g., Orbitz.formation about the signaler [18], we can reason about thefactors underlying deception by manipulating the distinct • Hypothesis 2 : Increasing the signal cost will decreasesignal costs associated with truthful vs. deceptive reviews. the prevalence of deception. Specifically, we claim that for a positive review to beposted in a given review community, there must be an in- 7. EXPERIMENTAL SETUPcurred signal cost, that is increased by: The framework described in Section 2 is instantiated for 1. The posting cost for posting the review in a given the six review communities introduced in Section 5. In par- review community, i.e., whether users are required to ticular, we first train our SVM deception classifier following purchase a product prior to reviewing it (high cost) or the procedure outlined in Section 4.1. An important step not (low cost). Some sites, for example, allow anyone when training SVM classifiers is setting the cost parameter, to post reviews about any hotel, making the review C. We set C using a nested 5-fold cross-validation pro- cost effectively zero. Other sites, however, require the cedure, and choose the value that gives the best average 1 purchase of the hotel room before a review can be writ- balanced accuracy, defined as 2 (sensitivity + specificity). ten, raising the cost from zero to the price of the room. We then estimate the classifier’s sensitivity, specificity, and hyperparameters, using the procedure outlined in Sec-and decreased by: tion 4.2 and Appendix B. Based on those estimates, we then 2. The exposure benefit of posting the review in that estimate the prevalence of deception among reviews in our review community, i.e., the benefit derived from other test set using the Na¨ and the Bayesian Prevalence Models. ıve users reading the review, which is proportional to the Gibbs sampling for the Bayesian Prevalence Model is per- size of the review community’s audience. Review sites formed using Equations 7 and 8 (given in Appendix A) for with more traffic have greater exposure benefit. 70,000 iterations, with a burn-in of 20,000 iterations, and a sampling lag of 50. We use an uninformative (uniform)Observe that both the posting cost and the exposure benefit prior for π ∗ , i.e., α = 1, 1 . Multiple runs are performed todepend entirely on the review community. An overview of verify the stability of the results. 206
  7. 7. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, France 10%   12%   12%   8%   10%   10%   8%   8%   6%   6%   6%   4%   4%   4%   2%   2%   2%   0%   0%   0%   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   (a) Orbitz (b) Priceline (c) Expedia 12%   12%   12%   10%   10%   10%   8%   8%   8%   6%   6%   6%   4%   4%   4%   2%   2%   2%   0%   0%   0%   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   (d) (e) Yelp (f) TripAdvisorFigure 3: Graph of Bayesian estimates of deception prevalence versus time, for six online review communities.Blue (a–d) and red (e–f ) graphs correspond to high and low posting cost communities, respectively. Errorbars show Bayesian 95% credible intervals.8. RESULTS AND DISCUSSION respond to communities with High and Low posting costs, Estimates of the prevalence of deception for six review respectively.communities over time, given by the Na¨ Prevalence Model, ıve In agreement with Hypothesis 1 (Section 6), we again findappear in Figure 2. Blue graphs (a–d) correspond to com- that Low signal cost communities, e.g., TripAdvisor, seem tomunities with High posting cost (see Table 3), i.e., commu- contain larger quantities and accelerated growth of deceptivenities for which you are required to book a hotel room before opinion spam when compared to High signal cost communi-posting a review, while red graphs (e–f) correspond to com- ties, e.g., Orbitz. Interestingly, communities with a blend ofmunities with Low posting cost, i.e., communities that allow signal costs appear to have medium rates of deception thatany user to post reviews for any hotel. are neither growing nor declining, e.g.,, which In agreement with Hypothesis 1 (given in Section 6), it has a rate of deception of ≈ clear from Figure 2 that deceptive opinion spam is de- To test Hypothesis 2, i.e., that increasing the signal costcreasing or stationary over time for High posting cost re- will decrease the prevalence of deception, we need to increaseview communities (blue graphs, a–d). In contrast, review the signal cost, as we have defined it in Section 6. Thus, itcommunities that allow any user to post reviews for any ho- is necessary to either increase the posting cost, or decreasetel, i.e., Low posting cost communities (red graphs, e–f), are the exposure benefit. And while we have no control over aseeing growth in their rate of deceptive opinion spam. community’s exposure benefit, we can increase the posting Unfortunately, as discussed in Section 3.1, we observe that cost by, for example, hiding all reviews written by usersthe prevalence estimates produced by the Na¨ Prevalence ıve who have not posted at least two reviews. Essentially, byModel are often negative. This occurs when the rate at requiring users to post more than one review in order forwhich the classifier makes positive predictions is below the their review to be displayed, we are increasing the postingclassifier’s estimated false positive rate, suggesting both that cost and, accordingly, the signal cost as well.the estimated false positive rate of the classifier is perhaps Bayesian Prevalence Model estimates for TripAdvisor foroverestimated, and that the classifier’s estimated specificity varying signal costs appear in Figure 4. In particular, we(truthful recall rate, given by θ) is perhaps underestimated. give the estimated prevalence of deception over time af-We address this further in Section 8.1. ter removing reviews written by first-time review writers, The Bayesian Prevalence Model, on the other hand, en- and after removing reviews written by first- or second-timecodes the uncertainty in the estimated values of the classi- review writers. In agreement with Hypothesis 2, we see afier’s sensitivity and specificity through two Beta priors, and clear reduction in the prevalence of deception over time onin particular their hyperparameters, β and γ. Estimates of TripAdvisor after removing these reviews, with rates drop-the prevalence of deception for the six review communities ping from ≈ 6%, to ≈ 5%, and finally to ≈ 4%, suggestingover time, given by the Bayesian Prevalence Model, appear that an increased signal cost may indeed help to reduce thein Figure 3. Blue (a–d) and red (e–f) graphs, as before, cor- prevalence of deception in online review communities. 207
  8. 8. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, France 12%   12%   12%   10%   10%   10%   8%   8%   8%   6%   6%   6%   4%   4%   4%   2%   2%   2%   0%   0%   0%   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   Jan-­‐09   Jul-­‐09   Jan-­‐10   Jul-­‐10   Jan-­‐11   Jul-­‐11   (a) TripAdvisor. All reviews. (b) TripAdvisor. First-time reviewers ex- (c) TripAdvisor. First-time and second- cluded. time reviewers excluded.Figure 4: Graph of Bayesian estimates of deception prevalence versus time, for TripAdvisor, with reviewswritten by new users excluded. Excluding reviews written by first- or second-time reviewers increases thesignal cost, and decreases the prevalence of deception.8.1 Assumptions and Limitations scale studies looking at how often people lie in everyday In this work we have made a number of assumptions, a communication, DePaulo et al. [4] used a diary method tofew of which we will now highlight and discuss. calculate the average number of lies told per day. At the First, we note that our unlabeled test set, Dtest , overlaps end of seven days participants told approximately one towith our labeled truthful training set, Dtrain . Consequently, two lies per day, with more recent studies replicating thiswe will underestimate the prevalence of deception, because general finding [6], suggesting that deception is frequent inthe overlapping reviews will be more likely to be classified human communication. More recently, Serota et al. [17]at test time as truthful, having been seen in training as be- conducted a large scale representative survey of Americansing truthful. Excluding these overlapping reviews from the asking participants how often they lied in the last 24 hours.test set results in overestimating the prevalence of decep- While they found the same average deception rate as pre-tion, based on the hypothesis that the overlapping reviews, vious research (approximately 1.65 lies per day), they dis-chosen from the 20 most highly-reviewed Chicago hotels, are covered that the data was heavily skewed, with 60 percentmore likely to be truthful to begin with. of the participants reporting no lies at all. They concluded Second, we observe that our development set, Ddev , con- that rather than deception prevalence being spread evenlytaining labeled truthful reviews, is not gold-standard. Un- across the population, there are instead a few prolific liars.fortunately, while it is necessary to obtain a uniform sample Unfortunately, both sides of this debate have relied solelyof reviews in order to fairly estimate the classifier’s truthful on self-report data.recall rate (specificity), such review samples are inherently The current approach offers a novel method for assessingunlabeled. This can be problematic if the underlying rate of deception prevalence that does not require self-report, butdeception is high among the reviews from which the devel- can provide insight into the prevalence of deception in hu-opment set is sampled, because the specificity will then be man communication more generally. At the same time, theunderestimated. Indeed, our Na¨ Prevalence Model regu- ıve question raised by the psychological research also mirrorslarly produces negative estimates, suggesting that the esti- an important point regarding the prevalence of deception inmated classifier specificity may indeed be underestimated, online reviews: are a few deceptive reviews posted by manypossibly due to deceptive reviews in the development set. people, or are there many deceptive reviews told by only a Third, our proposal for increasing the signal cost, by hid- few? That is, do some hotels have many fake reviews whileing reviews written by first- or second-time reviewers, is not others are primarily honest? Or, is there a little bit of cheat-ideal. While our results confirm that hiding these reviews ing by most hotels? This kind of individualized modelingwill cause an immediate reduction in deception prevalence, represents an important next step in this line of research.the increase in signal cost might be insufficient to discouragenew deception, once deceivers become aware of the increasedposting requirements. 9. CONCLUSION Fourth, in this work we have only considered a limited In this work, we have presented a general framework forversion of the deception prevalence problem. In particular, estimating the prevalence of deception in online review com-we have only considered positive Chicago hotel reviews, and munities, based on the output of a noisy deception classifier.our classifier is trained on deceptive reviews coming only Using this framework, we have explored the prevalence offrom Amazon Mechanical Turk. Both negative reviews as deception among positive reviews in six popular online re-well as deceptive reviews obtained by other means are likely view communities, and provided the first empirical study ofto be different in character than the data used in this study. the magnitude, and influencing factors of deceptive opinion spam.8.2 Implications for Psychological Research We have additionally proposed a theoretical model of on- The current research also represents a novel approach to a line reviews as a signal to a product’s true (unknown) qual-long-standing and ongoing debate around deception preva- ity, based on economic signaling theory. Specifically, we havelence in the psychological literature. In one of the first large- defined the signal cost of positive online reviews as a func- 208
  9. 9. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, Francetion of the posting costs and exposure benefits of the review revisited. American Journal of Epidemiology,community in which it is posted. Based on this theory, we 153(9):921, 2001.have further suggested two hypotheses, both of which are [10] L. Joseph, T. Gyorkos, and L. Coupal. Bayesiansupported by our findings. In particular, we find first that estimation of disease prevalence and the parameters ofreview communities with low signal costs (low posting re- diagnostic tests in the absence of a gold standard.quirements, high exposure) have more deception than com- American Journal of Epidemiology, 141(3):263, 1995.munities with comparatively higher signal costs. Second, we [11] E. Lim, V. Nguyen, N. Jindal, B. Liu, and H. Lauw.find that by increasing the signal cost of a review community, Detecting product review spammers using ratinge.g., by excluding reviews written by first- or second-time re- behaviors. In Proceedings of the 19th ACMviewers, we can effectively reduce both the prevalence and international conference on Information andthe growth rate of deception in that community. knowledge management, pages 939–948. ACM, 2010. Future work might explore other methods for manipu- [12] D. Meyer. Fake reviews prompt belkin apology. http:lating the signal costs associated with posting online re- //,views, and the corresponding effects on deception preva- Jan. 2009.lence. For example, some sites, such as Angie’s List (http: [13] R. Mihalcea and C. Strapparava. The lie detector://, charge a monthly access fee in or- Explorations in the automatic recognition of deceptiveder to browse or post reviews, and future work might study language. In Proceedings of the ACL-IJCNLP 2009the effectiveness of such techniques at deterring deception. Conference Short Papers, pages 309–312. Association for Computational Linguistics, 2009.10. ACKNOWLEDGMENTS [14] C. Miller. Company settles case of reviews it faked. This work was supported in part by National Science Grant NSCC-0904913, and the Jack Kent Cooke internet/15lift.html, July 2009.Foundation. We also thank, alphabetically, Cristian Danescu- [15] M. Ott, Y. Choi, C. Cardie, and J. Hancock. FindingNiculescu-Mizil, Lillian Lee, Bin Lu, Karthik Raman, Lu deceptive opinion spam by any stretch of theWang, and Ainur Yessenalina, as well as members of the imagination. In Proceedings of the 49th AnnualCornell NLP seminar group and the WWW reviewers for Meeting of the Association for Computationaltheir insightful comments, suggestions and advice on vari- Linguistics: Human Language Technologies-Volume 1,ous aspects of this work. pages 309–319. Association for Computational Linguistics, 2011.11. REFERENCES [16] B. Page. Amazon withdraws ebook explaining how to [1] C. Bond and B. DePaulo. Accuracy of deception manipulate its sales rankings. judgments. Personality and Social Psychology Review, 10(3):214, 2006. amazon-ebook-manipulate-kindle-rankings, Jan. [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for 2011. support vector machines. ACM Transactions on [17] K. Serota, T. Levine, and F. Boster. The prevalence of Intelligent Systems and Technology, 2:27:1–27:27, lying in America: Three studies of self-reported lies. 2011. Software available at Human Communication Research, 36(1):2–25, 2010. [18] M. Spence. Job market signaling. The quarterly [3] Cone. 2011 Online Influence Trend Tracker. Online: journal of Economics, 87(3):355, 1973. [19] D. Streitfeld. In a race to out-rave, 5-star web reviews online-reverse-purchase-decisions, August 2011. go for $5. [4] B. DePaulo, D. Kashy, S. Kirkendol, M. Wyer, and technology/finding-fake-reviews-online.html, J. Epstein. Lying in everyday life. Journal of Aug. 2011. personality and social psychology, 70(5):979, 1996. [20] D. Streitfeld. For $2 a star, an online retailer gets [5] J. Hancock. Digital Deception: The Practice of Lying 5-star product reviews. http: in the Digital Age. Deception: Methods, Contexts and // Consequences, pages 109–120, 2009. a-star-a-retailer-gets-5-star-reviews.html, [6] J. Hancock, J. Thom-Santelli, and T. Ritchie. Jan. 2012. Deception and design: The impact of communication [21] A. Topping. Historian Orlando Figes agrees to pay technology on lying behavior. In Proceedings of the damages for fake reviews. SIGCHI conference on Human factors in computing systems, pages 129–134. ACM, 2004. orlando-figes-fake-amazon-reviews, July 2010. [7] N. Hu, L. Liu, and J. Zhang. Do online reviews affect [22] A. Vrij. Detecting lies and deceit: Pitfalls and product sales? The role of reviewer characteristics and opportunities. Wiley-Interscience, 2008. temporal effects. Information Technology and [23] K. Yoo and U. Gretzel. Comparison of Deceptive and Management, 9(3):201–214, 2008. Truthful Travel Reviews. Information and [8] N. Jindal and B. Liu. Opinion spam and analysis. In Communication Technologies in Tourism 2009, pages Proceedings of the international conference on Web 37–47, 2009. search and web data mining, pages 219–230. ACM, 2008. [9] W. Johnson, J. Gastwirth, and L. Pearson. Screening without a “gold standard”: The Hui-Walter paradigm 209
  10. 10. WWW 2012 – Session: Fraud and Bias in User Ratings April 16–20, 2012, Lyon, FranceAPPENDIX B. ESTIMATING CLASSIFIER SENSITIV-A. GIBBS SAMPLER FOR BAYESIAN PREVA- ITY AND SPECIFICITY We estimate the sensitivity and specificity of our deception LENCE MODEL classifier via the following procedure: Gibbs sampling of the Bayesian Prevalence Model, intro-duced in Section 3.2, is performed according to the following 1. Assume given a labeled training set, Dtrain , containingconditional distributions: N train reviews of n hotels. Also assume given a devel- opment set, Ddev , containing labeled truthful reviews. Pr(yi = 1 | f (x), y(−i) ; α, β, γ) 2. Split Dtrain into n folds, D1 , . . . , Dn , of sizes given train train train train train (−i) β f (xi ) + Xf (xi ) by, N1 , . . . , Nn , respectively, such that Dj con- (−i) train ∝ (α1 + N1 )· (−i) , (7) tains all (and only) reviews of hotel j. Let D(−j) con- β + N1 tain all reviews except those of hotel j.and, 3. Then, for each hotel j: train Pr(yi = 0 | f (x), y(−i) ; α, β, γ) (a) Train a classifier, fj , from reviews in D(−j) , and train (−i) use it to classify reviews in Dj . (−i) γ 1−f (xi ) + Yf (xi ) (b) Let |T P |j correspond to the observed number of ∝ (α0 + N0 )· (−i) , (8) γ + N0 true positives, i.e.:where, |T P |j = σ[y = 1] · σ[fj (x) = 1]. (12) train (x,y)∈Dj (−i) Xk = σ[yj = 1] · σ[f (xj ) = k], j=i (c) Similarly, let |F N |j correspond to the observed Yk (−i) = σ[yj = 0] · σ[f (xj ) = k], number of false negatives. j=i 4. Calculate the aggregate number of true positives (|T P |) (−i) N1 = X0 (−i) + X1 (−i) , and false negatives (|F N |), and compute the sensitivity (deceptive recall) as: (−i) (−i) (−i) N0 = Y0 + Y1 . |T P |After sampling, we reconstruct the collapsed variables to η= . (13) |T P | + |F N |yield the Bayesian Prevalence Model estimate of the preva-lence of deception: 5. Train a classifier using all reviews in Dtrain , and use it to classify reviews in Ddev . α1 + N 1 πbayes = . (9) 6. Let the resulting number of true negative and false α + N test positive predictions in Ddev be given by |T N |dev andEstimates of the classifier’s sensitivity and specificity are |F P |dev , respectively, and compute the specificity (truth-similarly given by: ful recall) as: β 1 + X1 |T N |dev ηbayes = , (10) θ= . (14) β + N1 |T N |dev + |F P |dev γ 1 + Y0 θbayes = . (11) For the Bayesian Prevalence Model, we observe that the γ + N0 posterior distribution of a variable with an uninformative (uniform) Beta prior, after observing a successes and b fail- ures, is just Beta(a+1, b+1), i.e., a and b are pseudo counts. Based on this observation, we set the hyperparameters β and γ, corresponding to the classifier’s sensitivity (deceptive re- call) and specificity (truthful recall), respectively, to: β = |F N | + 1, |T P | + 1 , γ = |F P |dev + 1, |T N |dev + 1 . 210