Master Thesis Seminar

Sandra Sukarieh
Spam Spam…Spam!
Master’s Seminar 24 January 2020
Prof. Jilles Vreeken

2
This presentation has been identified by
experts to be avoiding MDL !
Viewer discretion is advised.

What Spam Spam...Spam?
4Amazon.de

6
FH: Best game ever! I
love the pictures and
the quality!
RECOMMENDED!!
JF: I got it as a gift
and I loooooove it
<3
JV: I have never
enjoyed a game like
this one!
SS: This game is
super with a super
quality!

7
 More than 20 % of Yelp’s reviews are of misleading content with
steady growth and one-third of all consumer reviews on the
Internet are estimated to be misleading [Rayana and Akoglu 2015].
 Spammers are becoming smarter in hiding themselves.

Has anyone noticed the Spam Spam...Spam?
8
Fake Reviews and Likes
• Liu et al. SPEC and SVM classification (EMNLP-CoNLL, 2007).
Suspicious Users
• Jiang et al. CatchSync (KDD, 2014).
Collusion Groups
• Cao et al. SynchroTrap (CCS, 2014 ).
• Beutel et al. CopyCatch (WWW, 2013).
• Xu et al. KNN and transactions history (CIKM, 2013 ).

Another way to deal with Spam Spam...Spam?
9
6 Jan 2020
8-9 Jan 2020
15-17 Dec 2019

Another way to deal with Spam Spam...Spam?
10
6 Jan 2020
8-9 Jan 2020
15-17 Jan 2020
FH: Best game ever! I
love the pictures and
the quality!
RECOMMENDED!!
JF: I got it as a gift
and I loooooove it <3
JV: I have never
enjoyed a game like
this one!
SS: This game is
super with a super
quality!

Spammy Spammy...Spammy… Time Intervals
11
 Not done before!
 Doesn't depend on assumptions that can be easily broken.
 Might help in catching smart spammers!
 Might help in catching one-time spamming campaigns!
 Further results can be reported.

12
𝑡 is a time interval.
If 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦𝑡 𝑡 ≥ 𝜇 𝑡 ⇒ 𝑡 is reported as a spammy time interval.
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦𝑡 𝑡 =
1
3
[ψ 𝑡 + ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 + ψ 𝑝𝑟𝑜𝑏 𝑡 ]

13
1
3
 The weight of a time interval.
 Represents the characteristics of the interval itself.
 Defined by three characteristics:
 Density.
 Users Ratio.
 Time Weight.

Time Intervals Weight
14
𝒕 𝟏 𝒕 𝟐
Time Interval Density

Time Intervals Weight
15
Time Interval Time Weight
6 Jan 2020 6 - 8 Jan 2020

16
1
3
 The pairs score of a time interval.
 Represents the effect of what’s happening in other intervals.
 Defined as the normalized sum of the following:
𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑡′) 𝑡 =
𝑢 ∩ 𝑢′ . ψ 𝑡′
|𝑢 ∪ 𝑢′|

17
1
3
 The weighted probability of the interval content.
 𝑝𝑟𝑜𝑏 𝑡 𝑝 : the probability of the interval content in the
distribution of the products rates.
 The less the probability, the more spammy the interval is.
 Defined as following:
ψ 𝑝𝑟𝑜𝑏 𝑡 = 1 − 𝑝𝑟𝑜𝑏(𝑡|𝑝)

18
If 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦𝑡 𝑡 ≥ 𝜇 𝑡 ⇒ 𝑡 is reported as a spammy time interval.
1
3
Do we need anything else to get the best possible results?????????

19
Reported intervals precision

20
Reported products precision (left) and recall (right)

21
Reported reviews precision (left) and recall (right)

22
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦𝑡 𝑡 ≥ 0.5 ∧ ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 ≥ 75%
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦𝑡 𝑡 ≥ 0.56 ∨ ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 ≥ 85% ∨ ψ 𝑝𝑟𝑜𝑏 𝑡 ≤ 10−3
If 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦𝑡 𝑡 ≥ 𝜇 𝑡 ∨ ψ 𝑝𝑎𝑖𝑟𝑠 𝑡 ≥ 𝜇 𝑝𝑎𝑖𝑟𝑠 ∨ ψ 𝑝𝑟𝑜𝑏 𝑡 ≤ 𝜇 𝑝𝑟𝑜𝑏
⇒ 𝑡 is reported as a spammy time interval.

Spammy Spammy...Spammy… Groups
23
𝑔 is a group.
If 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 𝑔 ≥ 𝜇 𝑔 ⇒ 𝑔 is reported as a spamming group.
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 𝑔 =
1
6
[φ 𝐷 𝑔 + 1 − φ 𝑆 𝑔 + φ 𝑃 𝑔 + φ 𝑆 𝑔 + φ 𝑇𝑊 𝑔 + φ 𝐶𝐷 𝑔 ]

24
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 𝑔 =
1
6
[φ 𝐷 𝑔 + 1 − φ 𝑆 𝑔 + φ 𝑃 𝑔 + φ 𝑆 𝑔 + φ 𝑇𝑊 𝑔 + φ 𝐶𝐷 𝑔 ]
Minimum
Density
Maximum
Sparsity
Products
Count
Size
Time
Window
Co-reviewing
Ratio

25
Take the users of each reported interval ????????
Consider this set of users as a spamming group?????????
Just like that????????????????????????????????
Oh… we can rank them using the group spam score!
That’s it???????????????????????????????????????
NO!

26
Initial Candidate Groups
Repeat until
the score
becomes
worse
Remove the
least spammy
user
Set of users

27
 Initial groups are cliques in the user-user graph!
 We use the initial groups as blocks that can be merged to create
collusion spamming group.
Backtrack in
case the result
has a low
score
Repeat until
no more
possible
merges
Merge the
pair with the
highest
common users
ratio

28
If 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦𝑔 𝑔 ≥ 𝜇 𝑔 ⇒ 𝑔 is reported as a spamming group.
Reported groups precision

29
Reported groups recall (left) and F1-score (right)

30
Precision of reported spammers before and after grouping
Before Grouping After Grouping
0.430 0.941
0.722 1
0.792 0.984
0.208 0.762

Spammy Spammy...Spammy… Users
31
 Report users of the top-ranked intervals.
 Reported users are ranked based on a spamicity score of a user.
𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑢 𝑢 =
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠𝑅𝑎𝑡𝑖𝑜 𝑢 + 1
2
𝑖𝑓 𝑢 𝑖𝑠 𝑎 𝑚𝑒𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 1 𝑔𝑟𝑜𝑢𝑝
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠𝑅𝑎𝑡𝑖𝑜 𝑢 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Spammed Spammed...Spammed… Products
32
 Report the products of the top-ranked intervals.
 Reported products that were co-reviewed by all members of a
reported collusion group.
Category Before After
Recall 0.625 0.813
F1-score 0.769 0.897
Reported products results after adding the additional targets

Spammy Spammy...Spammy… Reviews
33
 Report the reviews of the top-ranked intervals.
 Reported reviews done by all members of a reported collusion group
to a product.
Category Before After
Recall 0.387 0.532
F1-score 0.558 0.695
Reported reviews results after adding the additional reviews

Conclusion
 Detecting suspicious time intervals is bright new and very helpful in detecting
spamming campaigns.
 The spamicity of an intervals is based on:
 The interval characteristics (weight).
 The effect of other time intervals (pairs score).
 The weighted probability of the interval content.
 When having a set of suspicious time intervals, we can:
 Create collusion spamming groups and score them.
 Report individual users, ranked by a spamicity estimation.
 Report targeted products.
 Report spammy reviews.
34

What’s next ??????????????
 Check the results on real Amazon data files.
 Compare the solution with other methods (already found some!).
 Find a cool name for the algorithm!
 Finish “not before” the deadline!
35

36
 Check the results on real Amazon data files.
 Compare the solution with other methods (already found some!).
 Find a cool name for the algorithm!
 Finish “not before” the deadline!
Thank you!

References
 Jingjing Liu, Yunbo Cao, Chin-Yew Lin, Yalou Huang, and Ming Zhou. Low-quality
product review detection in opinion summarization. Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (EMNLP-CoNLL), pages 334–342, 2007.
 Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang.
Catchsync: catching synchronized behavior in large directed graphs. KDD ’14
Proceedings of the 20th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 941–950, 2014.
 Bimal Viswanath, M. Ahmad Bashir, Mark Crovella, Saikat Guha, Krishna P.
Gummadi, Balachander Krishnamurthy, and Alan Mislove. Towards detecting
anomalous user behavior in online social networks. Proceedings of the 23rd
USENIX Security Symposium (USENIX Security) , pages 223–238, 2014.
 Qiang Cao, Xiaowei Yang, Jieqi Yu,and Christopher Palow. Uncovering large groups
of active malicious accounts in online social networks. CCS ’14 Proceedings of the
2014 ACM SIGSAC Conference on Computer and Communications Security, pages
477–488, 2014.
37

References
 Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, and Christos
Faloutsos. Copycatch: stopping group attacks by spotting lockstep behavior in
social networks. WWW ’13 Proceedings of the 22nd international conference on
World Wide Web , pages 119–130, 2013.
 Zhen Xie and Sencun Zhu. Grouptie: toward hidden collusion group discovery in
app stores. WiSec ’14 Proceedings of the 2014 ACM conference on Security and
privacy in wireless and mobile networks, pages 153–164, 2014.
 Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. Uncovering collusive
spammers in chinese review websites. CIKM ’13 Proceedings of the 22nd ACM
international conference on Information & Knowledge Management, pages 979–
988, 2013.
 Shebuti Rayana and Leman Akoglu. Collective opinion spam detection: Bridging
review networks and metadata. KDD ’15 Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 985–
994, 2015.
38

Master Thesis Seminar

Recommended

Recommended

More Related Content

Similar to Master Thesis Seminar

Similar to Master Thesis Seminar (20)

More from sandra sukarieh

More from sandra sukarieh (8)

Recently uploaded

Recently uploaded (20)

Master Thesis Seminar