SlideShare a Scribd company logo
Sandra Sukarieh
SPRAP: Detecting Opinion Spam
Campaigns in Online Rating Services
5 June 2020
Just a cool title? or something’s actually going wrong?
2Amazon.de
I got it as a gift and I
loooooove it <3
This game is super with a
high quality!
I have never enjoyed a
game like this one!
Best game ever! I love the
pictures and the quality!
RECOMMENDED!!
3
More than 20% of Yelp’s reviews are of misleading content
and one-third of all consumer reviews on the Internet are
estimated to be misleading [Rayana and Akoglu, 2015].
Spammers are becoming smarter in hiding themselves.
 Deceptive mix of legitimate reviews to build trust and fake reviews
to achieve the tasks.
 Avoid the well-known spam patterns.
Not just a cool title! Something’s INDEED going wrong!
Has anyone noticed that?
4
Fake Reviews
and Likes
• Liu et al., SPEC and SVM classification (EMNLP-
CoNLL, 2007)
Suspicious
Users
• Rayana and Akoglu, SPEAGLE (KDD, 2015)
Collusion
Groups
• Dhawan et al., DeFrauder (IJCAI, 2019)
Another way to deal with that? Maybe more robust?
characteristics that cannot be avoided
Relatively short period Using the same account  co-reviewing
# co-reviewed products
logof#pairsco-reviewed𝑛products
5
Another way to deal with that? Maybe more robust?
6
6 Jan 2020
8-9 Jan 2020
15-17 Dec 2019
Another way to deal with that? Maybe more robust?
7
6 Jan 2020
8-9 Jan 2020
15-17 Dec 2020
Detecting spam time intervals
in which spam campaigns
temporally take place
Detecting collusion spam
groups who perform those
spam campaigns
How to do it?
8
Spam behavior is rare and the majority are genuine
Anomaly detection probabilistic model:
 ∃𝑝 𝑟: 𝑥 𝑖𝑠 𝑠𝑝𝑎𝑚 ⇒ 𝑝 𝑟 𝑥 < 𝑠𝑜𝑚𝑒 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
Detecting spam time intervals
in which spam campaigns
temporally take place
Detecting collusion spam
groups who perform those
spam campaigns
∃𝑝 𝑇: 𝑡 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑠 𝑡𝑜 𝑎 𝑠𝑝𝑎𝑚
𝑐𝑎𝑚𝑝𝑎𝑖𝑔𝑛 ⇒ 𝑝 𝑇 𝑡 < 𝜇
∃𝑝 𝐺: 𝑔 𝑖𝑠 𝑎 𝑐𝑜𝑙𝑙𝑢𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑚
𝑔𝑟𝑜𝑢𝑝 ⇒ 𝑝 𝐺 𝑔 < 𝛿
How to do it?
9
𝑝 𝑇 𝑝 𝐺
Spamicity indicators
Spamicity scores
Intervals Spamicity Score
10
Spamicity indicators
Members
Count
Harmonious
Rates
Quick
Attacks
Big Deviation
from the Target’s
True Quality
Multiple Targets
Interval characteristics  interval weight ψ 𝑡
Size
s(𝑡)
Density 𝑑(𝑡) Weighted
Width w(𝑡)
Probability
f(𝑡)
Pairs Score
ψ 𝑝𝑎𝑖𝑟𝑠 𝑡
averaged in one spamicity score s𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑡
Groups Spamicity Score
11
Spamicity indicators
Targeted
Products
Members
Count
# Reviewed
Products NOT
Common
Between
Members
Quick
Attacks
Co-reviewing
Targets
Targets
Count
𝑓𝑔(𝑔)
Size
s(𝑔)
Sparsity
𝑠𝑝(𝑔)
Time
Window
𝑡𝑤(𝑔)
Co-reviewing
Ratio
𝑐𝑟(𝑔)
averaged in one spamicity score s𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔
# Reviewed
Products
Common
Between
Members
Density
𝑑(𝑔)
Simply averaging?
12
100 𝑘𝑔 = 100 𝑐𝑚
100 𝑘𝑔 + 100 𝑐𝑚
CDF to normalize values [Rayana and Akoglu, 2015].
Definition: Feature Normalization:
ℎ =
𝑃 𝑋 ≤ ℎ 𝑖𝑓 ℎ𝑖𝑔ℎ 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 ℎ 𝑖𝑠 𝑠𝑢𝑠𝑝𝑖𝑐𝑖𝑜𝑢𝑠
1 − 𝑃 𝑋 ≤ ℎ 𝑖𝑓 𝑙𝑜𝑤 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 ℎ 𝑖𝑠 𝑠𝑢𝑠𝑝𝑖𝑐𝑖𝑜𝑢𝑠
Normalized Values Averaging
13
We normalize the intervals/groups spamicity indicators
using the Feature Normalization definition, for example:
 𝑓 𝑡 = 1 − 𝑃(𝑋 ≤ 𝑓 𝑡 )
 ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 = 𝑃(𝑋 ≤ ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 )
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 t =
1
3
[ψ 𝑡 + 𝑓 𝑡 + ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 ]
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 g =
1
6
[𝑓𝑔 𝑔 + 𝑠 𝑔 + 𝑑 𝑔 + 𝑠𝑝 𝑔 + 𝑡𝑤 𝑔 + 𝑐𝑟 𝑔 ]
What 𝑃?
SPRAP - Outline
14
User-ProductBipartiteGraphReview
𝒕 𝟐
𝒕 𝟎
𝒕 𝟏
𝒕 𝒏−𝟐
𝒕 𝒏−𝟏
Extracted Intervals
Top
Ranked
Intervals
Initial Groups
CollusionSpammingGroups
Spam
reviews
Targeted
Products
Individual
Spammers
SPRAP – Top Ranked Intervals
15
Extracting intervals for each product 𝑞:
 Sliding window approach:
𝑤𝑖𝑑𝑡ℎ ∈ [1, |𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒 𝑞|]
 Huge and redundant space  𝑤𝑖𝑑𝑡ℎ ∈ [1, 𝜏]
Intervals with high spamicity score are reported:
 𝑖𝑓 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑡 ≥ 𝜇 ⇒ 𝑡 𝑖𝑠 𝑟𝑒𝑝𝑜𝑟𝑡𝑒𝑑 𝑎𝑠 𝑠𝑝𝑎𝑚 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
What 𝑃?
𝑃 is the intervals empirical distribution:
 Contains all valid intervals.
 Contains intervals before further filtering.
 Added intervals are merged to get wider entities.
SPRAP – Collusion Spam Groups
16
Creating all possible groups is infeasible.
We are not only after cliques in the user co-reviewing graph,
so we cannot use Maximum Cliques or MFIM.
We are only considering “valid groups”:
𝑢1
𝑢2
𝑢6
𝑢3
𝑢4
𝑢5
𝑢7
SPRAP – Collusion Spam Groups
17
Top
Ranked
Intervals
Initial Groups
CollusionSpammingGroups
Refined Groups
Groups taken directly from
Top Ranked Intervals
Groups after removing
non-spammers
Final reported groups
after merging the refined
groups (not necessarily
cliques)
SPRAP – Collusion Spam Groups
18
6 Jan 2020
8-9 Jan 2020
15-17 Dec 2020
SPRAP – Collusion Spam Groups
19
𝑃 is the valid groups empirical distribution, but:
 The set of created groups is very small.
 The majority of created groups is connected to spam
campaigns.
Creating all valid groups is infeasible  Sampling!
Straight-forward sampling can lead to a lot of
rejections  MCMC!
What 𝑃?
A Group is considered spam if 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 ≥ δ.
SPRAP – Collusion Spam Groups
20
Normalization
Schaeffer [2010] dealt with a balanced random walk:
 reaches a Uniform stationary distribution.
 undirected, unweighted graphs.
𝑝 𝑣,𝑤 =
min 1
deg 𝑣
,
1
deg 𝑤
𝑖𝑓 𝑤 ∈ 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠(𝑣)
1 −
𝑤∈𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑣
min 1
deg 𝑣
,
1
deg 𝑤
𝑖𝑓 𝑤 = 𝑣
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
However, our graph is the user co-reviewing graph and
we want to sample valid groups!
What 𝑃?
SPRAP – Collusion Spam Groups
21
Normalization
We define a Valid Groups Markov Chain :
 States are valid groups.
 We use the defined balanced random walk to sample valid
groups.
 No need to build the whole chain before sampling.
 We add a random jump with a small probability 𝜖.
What 𝑃?
SPRAP – Evaluation
22
Thresholds and Configurations
We estimate the best values of spamicity thresholds (𝜇, 𝛿)
by 5 repetitions of LOOCV.
We set the parameters as follows:
𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3
SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
SPRAP – Evaluation
24
Wide Dense Campaigns – Effects of 𝜏
Generated Interval Interval in 𝑻 Interval in 𝑰
01-09-2019, 08-09-2019
04-09-2019, 04-09-2019
01-09-2019, 08-09-2019
06-09-2019, 06-09-2019
04-09-2019, 06-09-2019
03-09-2019, 05-09-2019
05-09-2019, 06-09-2019
03-09-2019, 04-09-2019
06-09-2019, 08-09-2019
04-09-2019, 05-09-2019
01-09-2019, 03-09-2019
Details of detecting a time interval of width 8 in data set H.
SPRAP – Evaluation
25
Comparison to SPEAGLE [Rayana and Akoglu, 2015]
SPEAGLE reports spammers, fake reviews, and targets.
SPEAGLE depends heavily on textual characteristics 
we plant their labeled reviews in data set C whose
spammers are pure spammers.
Algorithm
Reviews Spammers Targets
R P R P R P
SPRAP 0.925 0.985 0.755 0.952 0.889 1
SPEAGLE 1 0.196 1 0.118 1 0.07
Results of SPRAP with 𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3 against the
best achieved recall and precision values for SPEAGLE.
SPRAP – Evaluation
26
Merging Groups and Comparison to DeFrauder [Dhawan et al., 2019]
DeFrauder detects collusion spam groups.
We compare between the two methods on the data set
D which has 6 planted collusion spam groups of a mixed
nature.
Algorithm |𝑪| 𝒔 𝒎𝒂𝒙(𝒈)
Spammers Targets
R P R P
SPRAP 9 26 1 0.955 1 0.92
DeFrauder 126 5 0.709 0.329 1 0.383
Results of SPRAP with 𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3 against DeFrauder.
SPRAP – Evaluation
27
Merging Groups and Comparison to DeFrauder [Dhawan et al., 2019]
Group
All targets
reviews by
all members
Reported
as 1
group
Original in
refined
groups
Members
Reported
Targets
FP
Members
𝑔1 Yes Yes 7 3/3 7/7 0
𝑔2 No Yes 9 4/4 9/9 0
𝑔3 No No, as 2 3 5/5 3/3 1
𝑔4 No No, as 3 6 12/12 5/5 0
𝑔5 No Yes 15 15/15 10/10 1
𝑔6 No Yes 8 25/25 5/5 1
Reported collusion groups of SPRAP for data set D.
SPRAP – Evaluation
28
Amazon Software data
Amazon Software data set:
 Unlabeled.
 Has 341931 reviews, 275374 users, and 28736 products.
Reported
Entities
Spam
Intervals
Spam Groups Spammers Fake Reviews Targets
Details
𝐼 = 9606 𝐶 = 3797 𝑆 = 37883 𝑌 = 48043 𝑍 = 1066
-
35.5% non-
cliques
37883 with
score ≥ 0.5
- -
- 33374 members - - -
Further notes:
 Longest reported time interval is of 71 days.
 Biggest reported collusion spam group is of 1139 members.
Conclusion
29
Detecting spam campaigns is not trivial due to:
 Lack of ground truth.
 Huge overlap between spam and genuine behavior.
 Evolution of spammers and altering their techniques.
Spamicity scores that depend on a set of indicators can be
a good approximation of the optimal distribution to detect
different spam entities.
We presented SPRAP:
 Detects different spam entities with a very good accuracy.
 Starts from locating spam time intervals.
 Avoids easily broken assumptions.
What I did
Conclusion
30
Turning the solution into a full probabilistic anomaly
detection model.
Weighting the spamicity indicator differently to favor some
over the others (e.g. favor groups with more targets.)
Importance groups sampling to include more “close-to-
spam” groups.
What could be done
Thank you!
Special thanks to Prof. Vreeken who gave me
the opportunity to be a part of the amazing
EDA group and supported me all over the way,
and Janis for his valuable assistance and his help
throughout the whole process.
I guess I have a Master’s degree now :D
References
 Jingjing Liu, Yunbo Cao, Chin-Yew Lin, Yalou Huang, and Ming Zhou. Low-quality product review
detection in opinion summarization. Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL), pages 334–342, 2007.
 Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang. Catchsync: catching
synchronized behavior in large directed graphs. KDD ’14 Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 941–950, 2014.
 Bimal Viswanath, M. Ahmad Bashir, Mark Crovella, Saikat Guha, Krishna P. Gummadi, Balachander
Krishnamurthy, and Alan Mislove. Towards detecting anomalous user behavior in online social
networks. Proceedings of the 23rd USENIX Security Symposium (USENIX Security) , pages 223–238,
2014.
 Qiang Cao, Xiaowei Yang, Jieqi Yu,and Christopher Palow. Uncovering large groups of active
malicious accounts in online social networks. CCS ’14 Proceedings of the 2014 ACM SIGSAC
Conference on Computer and Communications Security, pages 477–488, 2014.
 Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, and Christos Faloutsos.
Copycatch: stopping group attacks by spotting lockstep behavior in social networks. WWW ’13
Proceedings of the 22nd international conference on World Wide Web , pages 119–130, 2013.
References
 Zhen Xie and Sencun Zhu. Grouptie: toward hidden collusion group discovery in app stores. WiSec
’14 Proceedings of the 2014 ACM conference on Security and privacy in wireless and mobile
networks, pages 153–164, 2014.
 Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. Uncovering collusive spammers in chinese
review websites. CIKM ’13 Proceedings of the 22nd ACM international conference on Information &
Knowledge Management, pages 979–988, 2013.
 Shebuti Rayana and Leman Akoglu. Collective opinion spam detection: Bridging review networks
and metadata. KDD ’15 Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 985– 994, 2015.
 Sarthika Dhawan, Siva Charan Reddy Gangireddy, Shiv Kumar, and Tanmoy Chakraborty. Spotting
collective behaviour of online frauds in customer reviews. IJCAI-19, pages 245–251, 2019.
 Satu Schaeffer. Scalable uniform graph sampling by local computation. SIAM J. Scientific
Computing, 32:2937–2963, 01 2010. doi: 10.1137/080716086.
Appendix A
37
Extracting Intervals
𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒 𝑞𝑑𝑎𝑦1 𝑑𝑎𝑦2 𝑑𝑎𝑦3 𝑑𝑎𝑦4 𝑑𝑎𝑦5 𝑑𝑎𝑦6 𝑑𝑎𝑦7 𝑑𝑎𝑦8 𝑑𝑎𝑦9 𝑑𝑎𝑦10 𝑑𝑎𝑦11 𝑑𝑎𝑦12 𝑑𝑎𝑦13 𝑑𝑎𝑦14
5 3 0 2
4 3 0 2
3 6 2 1
2 0 10 3
1 0 1 6
up-voting
𝑤𝑖𝑑𝑡ℎ = 1
𝑤𝑖𝑑𝑡ℎ = 4
down-voting
𝑤𝑖𝑑𝑡ℎ = 2
𝑤𝑖𝑑𝑡ℎ = 3
 𝑤𝑖𝑑𝑡ℎ ∈ [1, |𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒 𝑞|]
 Huge and redundant space  𝑤𝑖𝑑𝑡ℎ ∈ [1, 𝜏]
Appendix B
Refining Groups
A Group is considered spam if 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 ≥ δ.
Refining groups is done by removing the least-spammy user
in each iteration as long as the spamicity is increasing.
The least-spammy user is estimated based on:
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠_ 𝑟𝑎𝑡𝑖𝑜(𝑢) =
𝑡 ∈𝐼 𝟙{𝑢 ∈ 𝑈𝑡}
𝑡 ∈𝑇 𝟙{𝑢 ∈ 𝑈𝑡}
Appendix B
Reporting Groups
Merging refined groups is done iteratively as long as the
spamicity of the resulted group is preserved.
In each iteration with merge the pair with the highest
common-users ratio.
Reported Collusion spam groups are not necessarily cliques
in the user co-reviewing graph, unlike the initial and the
refined ones.
Appendix C
Intervals Spamicity Threshold
Appendix C
Groups Spamicity Threshold
Appendix D
Sparse Campaigns – Effects of 𝜏
Appendix D
Sparse Campaigns – Effects of 𝜏
𝜏 (Days)
Reviews Spammers
Grouped
Spammers
Targets
R P R P R P R P
10 0.86 1 0.947 1 0.842 1 1 1
20 0.907 0.951 0.947 0.9 0.947 0.947 1 1
30 0.953 0.922 0.947 0.9 0.895 0.944 1 1
40 0.976 0.913 1 0.864 0.947 0.9 1 1
50 0.977 0.91 1 0.82 0.947 0.857 1 1
Results of increasing 𝜏 to catch temporal sparse campaigns.
Appendix E
Comparison to Baselines
Appendix F
Amazon Software data
Appendix F
Amazon Software data
The highest-ranked interval 𝑡 𝑚𝑎𝑥:
 Spamicity score = 0.987.
 Up-voting campaign with 17 high rates over 2 days.
 Low probability since the target has a lot of reviews ∈ {1, 2, 3}.
The highest-ranked collusion group 𝑔 𝑚𝑎𝑥:
 Spamicity score = 0.89.
 16 users giving 5-rate reviews to one target 𝑞 during 2 days.
 Majority of members only reviewed 𝑞.
 Corresponding initial group has 27 members.
Appendix G
Calculation Formulas
Time interval density:
𝑑 𝑡 =
𝑟(𝑢,𝑞)∈𝑅 𝑡
𝑑′(𝑟𝑎𝑡𝑒 𝑟)
𝑑′
𝑟𝑎𝑡𝑒 𝑟 =
𝛼 𝑖𝑓 𝑟𝑎𝑡𝑒 𝑟 ∈ {1, 5}
𝛽 𝑖𝑓 𝑟𝑎𝑡𝑒 𝑟 ∈ {1, 5}
𝛾 𝑖𝑓 𝑟𝑎𝑡𝑒 𝑟 = 3
Time interval weighted width:
𝑤 𝑡 = 𝑒−𝑥 ∶ 𝑥 = 𝑤𝑖𝑑𝑡ℎ 𝑡 − 1
Appendix G
Calculation Formulas
Time interval probability:
𝑓 𝑡 = 𝑝 𝑅𝑡 𝑅 𝑞 =
𝑈𝑡 !
𝑟1! 𝑟2! 𝑟3! 𝑟4! 𝑟5!
𝑝1
𝑟1 𝑝2
𝑟2 𝑝3
𝑟3 𝑝4
𝑟4 𝑝5
𝑟5
Time interval pairs score:
ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 =
𝑞′∈𝑄q 𝑡′∈𝑇𝑞′
𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑡′)
𝑠𝑐𝑜𝑟𝑒 𝑡, 𝑡′ =
𝑈𝑡 ∩ 𝑈 𝑡′
𝑈𝑡 ∪ 𝑈 𝑡′
. ψ(𝑡′)
Appendix G
Calculation Formulas
Group density:
𝑑 𝑔 =
1
|𝑄 𝑔|
𝑗∈𝑄 𝑔
𝟙{
𝑖∈𝑈 𝑔
𝐴 𝑖, 𝑗 = 1 ≥ 𝑈𝑔 . 𝜆 𝑑}
Group sparsity:
𝑠𝑝 𝑔 =
1
|𝑄 𝑔|
𝑗∈𝑄 𝑔
𝟙{
𝑖∈𝑈 𝑔
𝐵 𝑖, 𝑗 = 1 ≤ 𝑈𝑔 . 𝜆 𝑠𝑝}
Appendix G
Calculation Formulas
Group time window:
𝑡𝑤 𝑔 = max 𝑔𝑡𝑤 𝑔, 𝑞
𝑔𝑡𝑤 𝑔, 𝑞 =
0 𝑖𝑓 𝑠𝑝𝑎𝑛(𝑔, 𝑞) > 𝜃
1 −
𝑠𝑝𝑎𝑛 𝑔, 𝑞
𝜃
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑠𝑝𝑎𝑛 𝑔, 𝑞 = max 𝑑𝑎𝑡𝑒 𝑟 − min 𝑑𝑎𝑡𝑒 𝑟 : ∀𝑟(𝑢, 𝑞)
Group co-reviewing ratio:
𝑐𝑟 𝑔 =
1
|𝑈 𝑔|
2 𝑖=1
|𝑈 𝑔|
𝑗=𝑖+1
|𝑈 𝑔|
𝟙{ 𝑢𝑖, 𝑢𝑗 ∈ 𝐸}
Appendix G
Calculation Formulas
Spammer score:
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑢 =
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 𝑟𝑎𝑡𝑖𝑜 𝑢 + 1
2
𝑖𝑓 𝑢 ∈ 𝑔
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠_𝑟𝑎𝑡𝑖𝑜(𝑢) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

More Related Content

Similar to SPRAP - Master Thesis Defense

Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 

Similar to SPRAP - Master Thesis Defense (20)

Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Proto-Design Your Future - Capital One Digital for Good Summit
Proto-Design Your Future - Capital One Digital for Good SummitProto-Design Your Future - Capital One Digital for Good Summit
Proto-Design Your Future - Capital One Digital for Good Summit
 
Measurement and scaling noncomparative scaling technique
Measurement and scaling noncomparative scaling techniqueMeasurement and scaling noncomparative scaling technique
Measurement and scaling noncomparative scaling technique
 
chap 9.pptx
chap 9.pptxchap 9.pptx
chap 9.pptx
 
Marketing research ch 9_malhotra
Marketing research ch 9_malhotraMarketing research ch 9_malhotra
Marketing research ch 9_malhotra
 
Big Data Privacy Standard Requirements
Big Data Privacy Standard RequirementsBig Data Privacy Standard Requirements
Big Data Privacy Standard Requirements
 
Scor model
Scor modelScor model
Scor model
 
chap 9.pdf
chap 9.pdfchap 9.pdf
chap 9.pdf
 
IRJET-Fake Product Review Monitoring
IRJET-Fake Product Review MonitoringIRJET-Fake Product Review Monitoring
IRJET-Fake Product Review Monitoring
 
Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15
Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15
Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15
 
A/B Testing: Common Pitfalls and How to Avoid Them
A/B Testing: Common Pitfalls and How to Avoid ThemA/B Testing: Common Pitfalls and How to Avoid Them
A/B Testing: Common Pitfalls and How to Avoid Them
 
Figuring out the right metrics for your game
Figuring out the right metrics for your gameFiguring out the right metrics for your game
Figuring out the right metrics for your game
 
Akshit gupta management_science
Akshit gupta management_scienceAkshit gupta management_science
Akshit gupta management_science
 
Six sigma
Six sigmaSix sigma
Six sigma
 
UK GIAF: Winter 2015
UK GIAF: Winter 2015UK GIAF: Winter 2015
UK GIAF: Winter 2015
 
SIG-NOC Tools Survey 2019 Results
SIG-NOC Tools Survey 2019 ResultsSIG-NOC Tools Survey 2019 Results
SIG-NOC Tools Survey 2019 Results
 
Decisions
DecisionsDecisions
Decisions
 
Dilshod Achilov Gage R&R
Dilshod Achilov Gage R&RDilshod Achilov Gage R&R
Dilshod Achilov Gage R&R
 
file000243.pdf
file000243.pdffile000243.pdf
file000243.pdf
 
Process improvement 070617
Process improvement 070617Process improvement 070617
Process improvement 070617
 

More from sandra sukarieh

More from sandra sukarieh (8)

Schema learning
Schema learningSchema learning
Schema learning
 
Strong stubborn sets
Strong stubborn setsStrong stubborn sets
Strong stubborn sets
 
Cloud Computing Interoperability in Education
Cloud Computing Interoperability in EducationCloud Computing Interoperability in Education
Cloud Computing Interoperability in Education
 
Applications of Distributed Systems
Applications of Distributed SystemsApplications of Distributed Systems
Applications of Distributed Systems
 
Storyboarding - Information Systems Engineering
Storyboarding - Information Systems EngineeringStoryboarding - Information Systems Engineering
Storyboarding - Information Systems Engineering
 
Timed Colored Perti Nets
Timed Colored Perti NetsTimed Colored Perti Nets
Timed Colored Perti Nets
 
Web Server - Internet Applications
Web Server - Internet ApplicationsWeb Server - Internet Applications
Web Server - Internet Applications
 
Database Threats - Information System Security
Database Threats - Information System SecurityDatabase Threats - Information System Security
Database Threats - Information System Security
 

Recently uploaded

IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff
17thcssbs2
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Advances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfAdvances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdf
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptx
 

SPRAP - Master Thesis Defense

  • 1. Sandra Sukarieh SPRAP: Detecting Opinion Spam Campaigns in Online Rating Services 5 June 2020
  • 2. Just a cool title? or something’s actually going wrong? 2Amazon.de I got it as a gift and I loooooove it <3 This game is super with a high quality! I have never enjoyed a game like this one! Best game ever! I love the pictures and the quality! RECOMMENDED!!
  • 3. 3 More than 20% of Yelp’s reviews are of misleading content and one-third of all consumer reviews on the Internet are estimated to be misleading [Rayana and Akoglu, 2015]. Spammers are becoming smarter in hiding themselves.  Deceptive mix of legitimate reviews to build trust and fake reviews to achieve the tasks.  Avoid the well-known spam patterns. Not just a cool title! Something’s INDEED going wrong!
  • 4. Has anyone noticed that? 4 Fake Reviews and Likes • Liu et al., SPEC and SVM classification (EMNLP- CoNLL, 2007) Suspicious Users • Rayana and Akoglu, SPEAGLE (KDD, 2015) Collusion Groups • Dhawan et al., DeFrauder (IJCAI, 2019)
  • 5. Another way to deal with that? Maybe more robust? characteristics that cannot be avoided Relatively short period Using the same account  co-reviewing # co-reviewed products logof#pairsco-reviewed𝑛products 5
  • 6. Another way to deal with that? Maybe more robust? 6 6 Jan 2020 8-9 Jan 2020 15-17 Dec 2019
  • 7. Another way to deal with that? Maybe more robust? 7 6 Jan 2020 8-9 Jan 2020 15-17 Dec 2020 Detecting spam time intervals in which spam campaigns temporally take place Detecting collusion spam groups who perform those spam campaigns
  • 8. How to do it? 8 Spam behavior is rare and the majority are genuine Anomaly detection probabilistic model:  ∃𝑝 𝑟: 𝑥 𝑖𝑠 𝑠𝑝𝑎𝑚 ⇒ 𝑝 𝑟 𝑥 < 𝑠𝑜𝑚𝑒 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 Detecting spam time intervals in which spam campaigns temporally take place Detecting collusion spam groups who perform those spam campaigns ∃𝑝 𝑇: 𝑡 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑠 𝑡𝑜 𝑎 𝑠𝑝𝑎𝑚 𝑐𝑎𝑚𝑝𝑎𝑖𝑔𝑛 ⇒ 𝑝 𝑇 𝑡 < 𝜇 ∃𝑝 𝐺: 𝑔 𝑖𝑠 𝑎 𝑐𝑜𝑙𝑙𝑢𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑚 𝑔𝑟𝑜𝑢𝑝 ⇒ 𝑝 𝐺 𝑔 < 𝛿
  • 9. How to do it? 9 𝑝 𝑇 𝑝 𝐺 Spamicity indicators Spamicity scores
  • 10. Intervals Spamicity Score 10 Spamicity indicators Members Count Harmonious Rates Quick Attacks Big Deviation from the Target’s True Quality Multiple Targets Interval characteristics  interval weight ψ 𝑡 Size s(𝑡) Density 𝑑(𝑡) Weighted Width w(𝑡) Probability f(𝑡) Pairs Score ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 averaged in one spamicity score s𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑡
  • 11. Groups Spamicity Score 11 Spamicity indicators Targeted Products Members Count # Reviewed Products NOT Common Between Members Quick Attacks Co-reviewing Targets Targets Count 𝑓𝑔(𝑔) Size s(𝑔) Sparsity 𝑠𝑝(𝑔) Time Window 𝑡𝑤(𝑔) Co-reviewing Ratio 𝑐𝑟(𝑔) averaged in one spamicity score s𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 # Reviewed Products Common Between Members Density 𝑑(𝑔)
  • 12. Simply averaging? 12 100 𝑘𝑔 = 100 𝑐𝑚 100 𝑘𝑔 + 100 𝑐𝑚 CDF to normalize values [Rayana and Akoglu, 2015]. Definition: Feature Normalization: ℎ = 𝑃 𝑋 ≤ ℎ 𝑖𝑓 ℎ𝑖𝑔ℎ 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 ℎ 𝑖𝑠 𝑠𝑢𝑠𝑝𝑖𝑐𝑖𝑜𝑢𝑠 1 − 𝑃 𝑋 ≤ ℎ 𝑖𝑓 𝑙𝑜𝑤 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 ℎ 𝑖𝑠 𝑠𝑢𝑠𝑝𝑖𝑐𝑖𝑜𝑢𝑠
  • 13. Normalized Values Averaging 13 We normalize the intervals/groups spamicity indicators using the Feature Normalization definition, for example:  𝑓 𝑡 = 1 − 𝑃(𝑋 ≤ 𝑓 𝑡 )  ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 = 𝑃(𝑋 ≤ ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 ) 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 t = 1 3 [ψ 𝑡 + 𝑓 𝑡 + ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 ] 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 g = 1 6 [𝑓𝑔 𝑔 + 𝑠 𝑔 + 𝑑 𝑔 + 𝑠𝑝 𝑔 + 𝑡𝑤 𝑔 + 𝑐𝑟 𝑔 ] What 𝑃?
  • 14. SPRAP - Outline 14 User-ProductBipartiteGraphReview 𝒕 𝟐 𝒕 𝟎 𝒕 𝟏 𝒕 𝒏−𝟐 𝒕 𝒏−𝟏 Extracted Intervals Top Ranked Intervals Initial Groups CollusionSpammingGroups Spam reviews Targeted Products Individual Spammers
  • 15. SPRAP – Top Ranked Intervals 15 Extracting intervals for each product 𝑞:  Sliding window approach: 𝑤𝑖𝑑𝑡ℎ ∈ [1, |𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒 𝑞|]  Huge and redundant space  𝑤𝑖𝑑𝑡ℎ ∈ [1, 𝜏] Intervals with high spamicity score are reported:  𝑖𝑓 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑡 ≥ 𝜇 ⇒ 𝑡 𝑖𝑠 𝑟𝑒𝑝𝑜𝑟𝑡𝑒𝑑 𝑎𝑠 𝑠𝑝𝑎𝑚 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 What 𝑃? 𝑃 is the intervals empirical distribution:  Contains all valid intervals.  Contains intervals before further filtering.  Added intervals are merged to get wider entities.
  • 16. SPRAP – Collusion Spam Groups 16 Creating all possible groups is infeasible. We are not only after cliques in the user co-reviewing graph, so we cannot use Maximum Cliques or MFIM. We are only considering “valid groups”: 𝑢1 𝑢2 𝑢6 𝑢3 𝑢4 𝑢5 𝑢7
  • 17. SPRAP – Collusion Spam Groups 17 Top Ranked Intervals Initial Groups CollusionSpammingGroups Refined Groups Groups taken directly from Top Ranked Intervals Groups after removing non-spammers Final reported groups after merging the refined groups (not necessarily cliques)
  • 18. SPRAP – Collusion Spam Groups 18 6 Jan 2020 8-9 Jan 2020 15-17 Dec 2020
  • 19. SPRAP – Collusion Spam Groups 19 𝑃 is the valid groups empirical distribution, but:  The set of created groups is very small.  The majority of created groups is connected to spam campaigns. Creating all valid groups is infeasible  Sampling! Straight-forward sampling can lead to a lot of rejections  MCMC! What 𝑃? A Group is considered spam if 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 ≥ δ.
  • 20. SPRAP – Collusion Spam Groups 20 Normalization Schaeffer [2010] dealt with a balanced random walk:  reaches a Uniform stationary distribution.  undirected, unweighted graphs. 𝑝 𝑣,𝑤 = min 1 deg 𝑣 , 1 deg 𝑤 𝑖𝑓 𝑤 ∈ 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠(𝑣) 1 − 𝑤∈𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑣 min 1 deg 𝑣 , 1 deg 𝑤 𝑖𝑓 𝑤 = 𝑣 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 However, our graph is the user co-reviewing graph and we want to sample valid groups! What 𝑃?
  • 21. SPRAP – Collusion Spam Groups 21 Normalization We define a Valid Groups Markov Chain :  States are valid groups.  We use the defined balanced random walk to sample valid groups.  No need to build the whole chain before sampling.  We add a random jump with a small probability 𝜖. What 𝑃?
  • 22. SPRAP – Evaluation 22 Thresholds and Configurations We estimate the best values of spamicity thresholds (𝜇, 𝛿) by 5 repetitions of LOOCV. We set the parameters as follows: 𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3
  • 23. SPRAP – Evaluation 23 General Performance Data Set Intervals Reviews Targets Spammers Grouped Spammers R P R P R P R P R P A 1 1 1 1 1 1 1 1 1 1 B 1 1 1 1 1 1 1 1 1 1 C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976 D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955 E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972 F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989 G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946 H 1 1 1 1 1 1 1 1 0.938 1
  • 24. SPRAP – Evaluation 23 General Performance Data Set Intervals Reviews Targets Spammers Grouped Spammers R P R P R P R P R P A 1 1 1 1 1 1 1 1 1 1 B 1 1 1 1 1 1 1 1 1 1 C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976 D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955 E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972 F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989 G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946 H 1 1 1 1 1 1 1 1 0.938 1
  • 25. SPRAP – Evaluation 23 General Performance Data Set Intervals Reviews Targets Spammers Grouped Spammers R P R P R P R P R P A 1 1 1 1 1 1 1 1 1 1 B 1 1 1 1 1 1 1 1 1 1 C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976 D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955 E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972 F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989 G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946 H 1 1 1 1 1 1 1 1 0.938 1
  • 26. SPRAP – Evaluation 23 General Performance Data Set Intervals Reviews Targets Spammers Grouped Spammers R P R P R P R P R P A 1 1 1 1 1 1 1 1 1 1 B 1 1 1 1 1 1 1 1 1 1 C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976 D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955 E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972 F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989 G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946 H 1 1 1 1 1 1 1 1 0.938 1
  • 27. SPRAP – Evaluation 24 Wide Dense Campaigns – Effects of 𝜏 Generated Interval Interval in 𝑻 Interval in 𝑰 01-09-2019, 08-09-2019 04-09-2019, 04-09-2019 01-09-2019, 08-09-2019 06-09-2019, 06-09-2019 04-09-2019, 06-09-2019 03-09-2019, 05-09-2019 05-09-2019, 06-09-2019 03-09-2019, 04-09-2019 06-09-2019, 08-09-2019 04-09-2019, 05-09-2019 01-09-2019, 03-09-2019 Details of detecting a time interval of width 8 in data set H.
  • 28. SPRAP – Evaluation 25 Comparison to SPEAGLE [Rayana and Akoglu, 2015] SPEAGLE reports spammers, fake reviews, and targets. SPEAGLE depends heavily on textual characteristics  we plant their labeled reviews in data set C whose spammers are pure spammers. Algorithm Reviews Spammers Targets R P R P R P SPRAP 0.925 0.985 0.755 0.952 0.889 1 SPEAGLE 1 0.196 1 0.118 1 0.07 Results of SPRAP with 𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3 against the best achieved recall and precision values for SPEAGLE.
  • 29. SPRAP – Evaluation 26 Merging Groups and Comparison to DeFrauder [Dhawan et al., 2019] DeFrauder detects collusion spam groups. We compare between the two methods on the data set D which has 6 planted collusion spam groups of a mixed nature. Algorithm |𝑪| 𝒔 𝒎𝒂𝒙(𝒈) Spammers Targets R P R P SPRAP 9 26 1 0.955 1 0.92 DeFrauder 126 5 0.709 0.329 1 0.383 Results of SPRAP with 𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3 against DeFrauder.
  • 30. SPRAP – Evaluation 27 Merging Groups and Comparison to DeFrauder [Dhawan et al., 2019] Group All targets reviews by all members Reported as 1 group Original in refined groups Members Reported Targets FP Members 𝑔1 Yes Yes 7 3/3 7/7 0 𝑔2 No Yes 9 4/4 9/9 0 𝑔3 No No, as 2 3 5/5 3/3 1 𝑔4 No No, as 3 6 12/12 5/5 0 𝑔5 No Yes 15 15/15 10/10 1 𝑔6 No Yes 8 25/25 5/5 1 Reported collusion groups of SPRAP for data set D.
  • 31. SPRAP – Evaluation 28 Amazon Software data Amazon Software data set:  Unlabeled.  Has 341931 reviews, 275374 users, and 28736 products. Reported Entities Spam Intervals Spam Groups Spammers Fake Reviews Targets Details 𝐼 = 9606 𝐶 = 3797 𝑆 = 37883 𝑌 = 48043 𝑍 = 1066 - 35.5% non- cliques 37883 with score ≥ 0.5 - - - 33374 members - - - Further notes:  Longest reported time interval is of 71 days.  Biggest reported collusion spam group is of 1139 members.
  • 32. Conclusion 29 Detecting spam campaigns is not trivial due to:  Lack of ground truth.  Huge overlap between spam and genuine behavior.  Evolution of spammers and altering their techniques. Spamicity scores that depend on a set of indicators can be a good approximation of the optimal distribution to detect different spam entities. We presented SPRAP:  Detects different spam entities with a very good accuracy.  Starts from locating spam time intervals.  Avoids easily broken assumptions. What I did
  • 33. Conclusion 30 Turning the solution into a full probabilistic anomaly detection model. Weighting the spamicity indicator differently to favor some over the others (e.g. favor groups with more targets.) Importance groups sampling to include more “close-to- spam” groups. What could be done
  • 34. Thank you! Special thanks to Prof. Vreeken who gave me the opportunity to be a part of the amazing EDA group and supported me all over the way, and Janis for his valuable assistance and his help throughout the whole process. I guess I have a Master’s degree now :D
  • 35. References  Jingjing Liu, Yunbo Cao, Chin-Yew Lin, Yalou Huang, and Ming Zhou. Low-quality product review detection in opinion summarization. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL), pages 334–342, 2007.  Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang. Catchsync: catching synchronized behavior in large directed graphs. KDD ’14 Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 941–950, 2014.  Bimal Viswanath, M. Ahmad Bashir, Mark Crovella, Saikat Guha, Krishna P. Gummadi, Balachander Krishnamurthy, and Alan Mislove. Towards detecting anomalous user behavior in online social networks. Proceedings of the 23rd USENIX Security Symposium (USENIX Security) , pages 223–238, 2014.  Qiang Cao, Xiaowei Yang, Jieqi Yu,and Christopher Palow. Uncovering large groups of active malicious accounts in online social networks. CCS ’14 Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 477–488, 2014.  Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, and Christos Faloutsos. Copycatch: stopping group attacks by spotting lockstep behavior in social networks. WWW ’13 Proceedings of the 22nd international conference on World Wide Web , pages 119–130, 2013.
  • 36. References  Zhen Xie and Sencun Zhu. Grouptie: toward hidden collusion group discovery in app stores. WiSec ’14 Proceedings of the 2014 ACM conference on Security and privacy in wireless and mobile networks, pages 153–164, 2014.  Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. Uncovering collusive spammers in chinese review websites. CIKM ’13 Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 979–988, 2013.  Shebuti Rayana and Leman Akoglu. Collective opinion spam detection: Bridging review networks and metadata. KDD ’15 Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 985– 994, 2015.  Sarthika Dhawan, Siva Charan Reddy Gangireddy, Shiv Kumar, and Tanmoy Chakraborty. Spotting collective behaviour of online frauds in customer reviews. IJCAI-19, pages 245–251, 2019.  Satu Schaeffer. Scalable uniform graph sampling by local computation. SIAM J. Scientific Computing, 32:2937–2963, 01 2010. doi: 10.1137/080716086.
  • 37. Appendix A 37 Extracting Intervals 𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒 𝑞𝑑𝑎𝑦1 𝑑𝑎𝑦2 𝑑𝑎𝑦3 𝑑𝑎𝑦4 𝑑𝑎𝑦5 𝑑𝑎𝑦6 𝑑𝑎𝑦7 𝑑𝑎𝑦8 𝑑𝑎𝑦9 𝑑𝑎𝑦10 𝑑𝑎𝑦11 𝑑𝑎𝑦12 𝑑𝑎𝑦13 𝑑𝑎𝑦14 5 3 0 2 4 3 0 2 3 6 2 1 2 0 10 3 1 0 1 6 up-voting 𝑤𝑖𝑑𝑡ℎ = 1 𝑤𝑖𝑑𝑡ℎ = 4 down-voting 𝑤𝑖𝑑𝑡ℎ = 2 𝑤𝑖𝑑𝑡ℎ = 3  𝑤𝑖𝑑𝑡ℎ ∈ [1, |𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒 𝑞|]  Huge and redundant space  𝑤𝑖𝑑𝑡ℎ ∈ [1, 𝜏]
  • 38. Appendix B Refining Groups A Group is considered spam if 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 ≥ δ. Refining groups is done by removing the least-spammy user in each iteration as long as the spamicity is increasing. The least-spammy user is estimated based on: 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠_ 𝑟𝑎𝑡𝑖𝑜(𝑢) = 𝑡 ∈𝐼 𝟙{𝑢 ∈ 𝑈𝑡} 𝑡 ∈𝑇 𝟙{𝑢 ∈ 𝑈𝑡}
  • 39. Appendix B Reporting Groups Merging refined groups is done iteratively as long as the spamicity of the resulted group is preserved. In each iteration with merge the pair with the highest common-users ratio. Reported Collusion spam groups are not necessarily cliques in the user co-reviewing graph, unlike the initial and the refined ones.
  • 42. Appendix D Sparse Campaigns – Effects of 𝜏
  • 43. Appendix D Sparse Campaigns – Effects of 𝜏 𝜏 (Days) Reviews Spammers Grouped Spammers Targets R P R P R P R P 10 0.86 1 0.947 1 0.842 1 1 1 20 0.907 0.951 0.947 0.9 0.947 0.947 1 1 30 0.953 0.922 0.947 0.9 0.895 0.944 1 1 40 0.976 0.913 1 0.864 0.947 0.9 1 1 50 0.977 0.91 1 0.82 0.947 0.857 1 1 Results of increasing 𝜏 to catch temporal sparse campaigns.
  • 46. Appendix F Amazon Software data The highest-ranked interval 𝑡 𝑚𝑎𝑥:  Spamicity score = 0.987.  Up-voting campaign with 17 high rates over 2 days.  Low probability since the target has a lot of reviews ∈ {1, 2, 3}. The highest-ranked collusion group 𝑔 𝑚𝑎𝑥:  Spamicity score = 0.89.  16 users giving 5-rate reviews to one target 𝑞 during 2 days.  Majority of members only reviewed 𝑞.  Corresponding initial group has 27 members.
  • 47. Appendix G Calculation Formulas Time interval density: 𝑑 𝑡 = 𝑟(𝑢,𝑞)∈𝑅 𝑡 𝑑′(𝑟𝑎𝑡𝑒 𝑟) 𝑑′ 𝑟𝑎𝑡𝑒 𝑟 = 𝛼 𝑖𝑓 𝑟𝑎𝑡𝑒 𝑟 ∈ {1, 5} 𝛽 𝑖𝑓 𝑟𝑎𝑡𝑒 𝑟 ∈ {1, 5} 𝛾 𝑖𝑓 𝑟𝑎𝑡𝑒 𝑟 = 3 Time interval weighted width: 𝑤 𝑡 = 𝑒−𝑥 ∶ 𝑥 = 𝑤𝑖𝑑𝑡ℎ 𝑡 − 1
  • 48. Appendix G Calculation Formulas Time interval probability: 𝑓 𝑡 = 𝑝 𝑅𝑡 𝑅 𝑞 = 𝑈𝑡 ! 𝑟1! 𝑟2! 𝑟3! 𝑟4! 𝑟5! 𝑝1 𝑟1 𝑝2 𝑟2 𝑝3 𝑟3 𝑝4 𝑟4 𝑝5 𝑟5 Time interval pairs score: ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 = 𝑞′∈𝑄q 𝑡′∈𝑇𝑞′ 𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑡′) 𝑠𝑐𝑜𝑟𝑒 𝑡, 𝑡′ = 𝑈𝑡 ∩ 𝑈 𝑡′ 𝑈𝑡 ∪ 𝑈 𝑡′ . ψ(𝑡′)
  • 49. Appendix G Calculation Formulas Group density: 𝑑 𝑔 = 1 |𝑄 𝑔| 𝑗∈𝑄 𝑔 𝟙{ 𝑖∈𝑈 𝑔 𝐴 𝑖, 𝑗 = 1 ≥ 𝑈𝑔 . 𝜆 𝑑} Group sparsity: 𝑠𝑝 𝑔 = 1 |𝑄 𝑔| 𝑗∈𝑄 𝑔 𝟙{ 𝑖∈𝑈 𝑔 𝐵 𝑖, 𝑗 = 1 ≤ 𝑈𝑔 . 𝜆 𝑠𝑝}
  • 50. Appendix G Calculation Formulas Group time window: 𝑡𝑤 𝑔 = max 𝑔𝑡𝑤 𝑔, 𝑞 𝑔𝑡𝑤 𝑔, 𝑞 = 0 𝑖𝑓 𝑠𝑝𝑎𝑛(𝑔, 𝑞) > 𝜃 1 − 𝑠𝑝𝑎𝑛 𝑔, 𝑞 𝜃 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑠𝑝𝑎𝑛 𝑔, 𝑞 = max 𝑑𝑎𝑡𝑒 𝑟 − min 𝑑𝑎𝑡𝑒 𝑟 : ∀𝑟(𝑢, 𝑞) Group co-reviewing ratio: 𝑐𝑟 𝑔 = 1 |𝑈 𝑔| 2 𝑖=1 |𝑈 𝑔| 𝑗=𝑖+1 |𝑈 𝑔| 𝟙{ 𝑢𝑖, 𝑢𝑗 ∈ 𝐸}
  • 51. Appendix G Calculation Formulas Spammer score: 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑢 = 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 𝑟𝑎𝑡𝑖𝑜 𝑢 + 1 2 𝑖𝑓 𝑢 ∈ 𝑔 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠_𝑟𝑎𝑡𝑖𝑜(𝑢) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Editor's Notes

  1. Should talk less!
  2. Talk less!!!
  3. should reach here on 6 minutes!
  4. Finish this on 10 minutes!
  5. Finish this on 12 minutes!
  6. Should finish this on 15 minutes
  7. Talk less!
  8. Should finish this on 20 minutes