Slides of my Master's thesis defense: SPRAP: Detecting Opinion Spam Campaigns in Online Rating Services - Exploratory Data Analysis Group - Universität des Saarlandes
Presenting our work analyzing natural noise in user ratings for recommender systems. This presentation was done in the UMAP 2009 conference in Trento, Italy
Initial seminar of my Master's thesis of detecting spam campaigns in online rating services - Exploratory Data Analysis Group - Universität des Saarlandes
Presenting our work analyzing natural noise in user ratings for recommender systems. This presentation was done in the UMAP 2009 conference in Trento, Italy
Initial seminar of my Master's thesis of detecting spam campaigns in online rating services - Exploratory Data Analysis Group - Universität des Saarlandes
This Machine Learning Algorithms presentation will help you learn you what machine learning is, and the various ways in which you can use machine learning to solve a problem. At the end, you will see a demo on linear regression, logistic regression, decision tree and random forest. This Machine Learning Algorithms presentation is designed for beginners to make them understand how to implement the different Machine Learning Algorithms.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. Real world applications of Machine Learning
2. What is Machine Learning?
3. Processes involved in Machine Learning
4. Type of Machine Learning Algorithms
5. Popular Algorithms with a hands-on demo
- Linear regression
- Logistic regression
- Decision tree and Random forest
- N Nearest neighbor
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Proto-Design Your Future - Capital One Digital for Good SummitWarren Duffy
This was a case study presented at Capital One's Digital for Good Summit which highlighted lessons learned from an organization - PWA's Friends for Life Bike Rally - that proto-designed their future.
Description: Managing clients or volunteers, nurturing existing donors, and attending to the pressing priorities of THE PRESENT leave little time to think about THE FUTURE. Let alone deliberately crafting a future in which your organization thrives alongside the changing needs of your users, and evolving digital marketing, technology, and data landscapes.
Key Mindsets: Prototyping, Bringing Everyone Along, Knowing Your User
Tools: Prototyping, Empathy Maps, Experience Mapping, Experimentation Culture
Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15MLconf
Leveraging Machine Learning Techniques for Vehicle Auction Industry: Online shopping has grown in popularity over the years. Nowadays many shoppers turn to online shopping sites for shopping. By recommending those content that is relevant to the online shoppers we are minimizing the time they spent online and maximizing the business success of online shopping sites. Many online sites use recommendation systems nowadays and they leverage content based and or context based collaborative filtering machine learning techniques for this purpose. We have leveraged the power of few machine-learning techniques like collaborative filtering, neural networks, Bayesian learning for relevant content vehicle recommendation and time series forecasting for vehicle auction at Manheim. My talk will focus on some of these techniques and their uses on relevant content recommendation.
A/B Testing: Common Pitfalls and How to Avoid ThemIgor Karpov
Since the initial boom of A/B testing’s popularity in the early 2000s, marketers have learned to apply actual science to marketing and took a lot of the guesswork out of how to get more conversions or purchases. However, after running your first A/B test, you will most likely find yourself presented with questions such as what is a conclusive result or what sample size is required?
Figuring out the right metrics for your gameSaurav Sahu
This is a talk I gave at IGDA Conference 'Industry Speaks' on 1st April'17. I talked about how one should go about thinking the metrics to track in their games. Also, stressed on the fact that Analytics should not be an after-thought but should be squeezed in during the game production phase itself.
The slide discusses Google's HEART framework and Pirate Metrics while sharing an approach Goals/Signals/Metrics to make it easy to list down metrics once you have your goals.
The latter part of the slides talks about the generic biases one should be aware of.
Feel free to reach out incase of any query.
India is the world’s largest two-wheeler market and despite the losses suffered by bike makers in India during the shift to BS-IV, the industry recovered well with a massive growth of 14.80 percent. Overall the two-wheeler industry saw a total sales of over 2.01 cr. units sold in the Indian domestic market and exports from the country also shot up by over 20 percent. Within the Two Wheelers segment, Motorcycles grew by 13.69 percent respectively.(Financial Express). From that report we can conclude that every 5th person owns a 2-wheeler and it is a vast market. So we try to understand the decision making part which done before buy a bike. We took five criteria to which helps customers to take decision and five alternative choice from which customer can choose their final product.
UK GIAF: Winter 2015
26th November, London
Kindly hosted at the offices of Space Ape Games.
GIAF returns to London with a fantastic line-up of industry speakers covering a broad range of topics from the realm of game analytics.
Speakers:
Juan Gabriel Gomila Salas, CEO at Frogames
Slot machines: Tweaking randomness in Social Casino
Learn how manipulating randomness on social casino games drives engagement, retention and monetisation.
Fred Easey, Head of Analytics at Space Ape Games
Analytical techniques: A practical guide to answering business questions
Exploring different methods you can use as an analyst to understand your game data
Mark Robinson, CEO at deltaDNA
Survey results: The secrets to successful F2P ad monetisation
Get a first-look at the data generated from a research project on in-game advertising, with over 100 game developers surveyed on their top grossing F2P games.
Presentació a càrrec de Maria Isabel Gandia, cap de Comunicacions del CSUC, duta a terme al Performance Management Workshop celebrat el 4 de març de 2020 a Zagreb.
This Machine Learning Algorithms presentation will help you learn you what machine learning is, and the various ways in which you can use machine learning to solve a problem. At the end, you will see a demo on linear regression, logistic regression, decision tree and random forest. This Machine Learning Algorithms presentation is designed for beginners to make them understand how to implement the different Machine Learning Algorithms.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. Real world applications of Machine Learning
2. What is Machine Learning?
3. Processes involved in Machine Learning
4. Type of Machine Learning Algorithms
5. Popular Algorithms with a hands-on demo
- Linear regression
- Logistic regression
- Decision tree and Random forest
- N Nearest neighbor
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Proto-Design Your Future - Capital One Digital for Good SummitWarren Duffy
This was a case study presented at Capital One's Digital for Good Summit which highlighted lessons learned from an organization - PWA's Friends for Life Bike Rally - that proto-designed their future.
Description: Managing clients or volunteers, nurturing existing donors, and attending to the pressing priorities of THE PRESENT leave little time to think about THE FUTURE. Let alone deliberately crafting a future in which your organization thrives alongside the changing needs of your users, and evolving digital marketing, technology, and data landscapes.
Key Mindsets: Prototyping, Bringing Everyone Along, Knowing Your User
Tools: Prototyping, Empathy Maps, Experience Mapping, Experimentation Culture
Raji Balasuubramaniyan, Senior Data Scientist, Manheim at MLconf ATL - 9/18/15MLconf
Leveraging Machine Learning Techniques for Vehicle Auction Industry: Online shopping has grown in popularity over the years. Nowadays many shoppers turn to online shopping sites for shopping. By recommending those content that is relevant to the online shoppers we are minimizing the time they spent online and maximizing the business success of online shopping sites. Many online sites use recommendation systems nowadays and they leverage content based and or context based collaborative filtering machine learning techniques for this purpose. We have leveraged the power of few machine-learning techniques like collaborative filtering, neural networks, Bayesian learning for relevant content vehicle recommendation and time series forecasting for vehicle auction at Manheim. My talk will focus on some of these techniques and their uses on relevant content recommendation.
A/B Testing: Common Pitfalls and How to Avoid ThemIgor Karpov
Since the initial boom of A/B testing’s popularity in the early 2000s, marketers have learned to apply actual science to marketing and took a lot of the guesswork out of how to get more conversions or purchases. However, after running your first A/B test, you will most likely find yourself presented with questions such as what is a conclusive result or what sample size is required?
Figuring out the right metrics for your gameSaurav Sahu
This is a talk I gave at IGDA Conference 'Industry Speaks' on 1st April'17. I talked about how one should go about thinking the metrics to track in their games. Also, stressed on the fact that Analytics should not be an after-thought but should be squeezed in during the game production phase itself.
The slide discusses Google's HEART framework and Pirate Metrics while sharing an approach Goals/Signals/Metrics to make it easy to list down metrics once you have your goals.
The latter part of the slides talks about the generic biases one should be aware of.
Feel free to reach out incase of any query.
India is the world’s largest two-wheeler market and despite the losses suffered by bike makers in India during the shift to BS-IV, the industry recovered well with a massive growth of 14.80 percent. Overall the two-wheeler industry saw a total sales of over 2.01 cr. units sold in the Indian domestic market and exports from the country also shot up by over 20 percent. Within the Two Wheelers segment, Motorcycles grew by 13.69 percent respectively.(Financial Express). From that report we can conclude that every 5th person owns a 2-wheeler and it is a vast market. So we try to understand the decision making part which done before buy a bike. We took five criteria to which helps customers to take decision and five alternative choice from which customer can choose their final product.
UK GIAF: Winter 2015
26th November, London
Kindly hosted at the offices of Space Ape Games.
GIAF returns to London with a fantastic line-up of industry speakers covering a broad range of topics from the realm of game analytics.
Speakers:
Juan Gabriel Gomila Salas, CEO at Frogames
Slot machines: Tweaking randomness in Social Casino
Learn how manipulating randomness on social casino games drives engagement, retention and monetisation.
Fred Easey, Head of Analytics at Space Ape Games
Analytical techniques: A practical guide to answering business questions
Exploring different methods you can use as an analyst to understand your game data
Mark Robinson, CEO at deltaDNA
Survey results: The secrets to successful F2P ad monetisation
Get a first-look at the data generated from a research project on in-game advertising, with over 100 game developers surveyed on their top grossing F2P games.
Presentació a càrrec de Maria Isabel Gandia, cap de Comunicacions del CSUC, duta a terme al Performance Management Workshop celebrat el 4 de març de 2020 a Zagreb.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
2. Just a cool title? or something’s actually going wrong?
2Amazon.de
I got it as a gift and I
loooooove it <3
This game is super with a
high quality!
I have never enjoyed a
game like this one!
Best game ever! I love the
pictures and the quality!
RECOMMENDED!!
3. 3
More than 20% of Yelp’s reviews are of misleading content
and one-third of all consumer reviews on the Internet are
estimated to be misleading [Rayana and Akoglu, 2015].
Spammers are becoming smarter in hiding themselves.
Deceptive mix of legitimate reviews to build trust and fake reviews
to achieve the tasks.
Avoid the well-known spam patterns.
Not just a cool title! Something’s INDEED going wrong!
4. Has anyone noticed that?
4
Fake Reviews
and Likes
• Liu et al., SPEC and SVM classification (EMNLP-
CoNLL, 2007)
Suspicious
Users
• Rayana and Akoglu, SPEAGLE (KDD, 2015)
Collusion
Groups
• Dhawan et al., DeFrauder (IJCAI, 2019)
5. Another way to deal with that? Maybe more robust?
characteristics that cannot be avoided
Relatively short period Using the same account co-reviewing
# co-reviewed products
logof#pairsco-reviewed𝑛products
5
6. Another way to deal with that? Maybe more robust?
6
6 Jan 2020
8-9 Jan 2020
15-17 Dec 2019
7. Another way to deal with that? Maybe more robust?
7
6 Jan 2020
8-9 Jan 2020
15-17 Dec 2020
Detecting spam time intervals
in which spam campaigns
temporally take place
Detecting collusion spam
groups who perform those
spam campaigns
8. How to do it?
8
Spam behavior is rare and the majority are genuine
Anomaly detection probabilistic model:
∃𝑝 𝑟: 𝑥 𝑖𝑠 𝑠𝑝𝑎𝑚 ⇒ 𝑝 𝑟 𝑥 < 𝑠𝑜𝑚𝑒 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
Detecting spam time intervals
in which spam campaigns
temporally take place
Detecting collusion spam
groups who perform those
spam campaigns
∃𝑝 𝑇: 𝑡 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑠 𝑡𝑜 𝑎 𝑠𝑝𝑎𝑚
𝑐𝑎𝑚𝑝𝑎𝑖𝑔𝑛 ⇒ 𝑝 𝑇 𝑡 < 𝜇
∃𝑝 𝐺: 𝑔 𝑖𝑠 𝑎 𝑐𝑜𝑙𝑙𝑢𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑚
𝑔𝑟𝑜𝑢𝑝 ⇒ 𝑝 𝐺 𝑔 < 𝛿
9. How to do it?
9
𝑝 𝑇 𝑝 𝐺
Spamicity indicators
Spamicity scores
10. Intervals Spamicity Score
10
Spamicity indicators
Members
Count
Harmonious
Rates
Quick
Attacks
Big Deviation
from the Target’s
True Quality
Multiple Targets
Interval characteristics interval weight ψ 𝑡
Size
s(𝑡)
Density 𝑑(𝑡) Weighted
Width w(𝑡)
Probability
f(𝑡)
Pairs Score
ψ 𝑝𝑎𝑖𝑟𝑠 𝑡
averaged in one spamicity score s𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑡
11. Groups Spamicity Score
11
Spamicity indicators
Targeted
Products
Members
Count
# Reviewed
Products NOT
Common
Between
Members
Quick
Attacks
Co-reviewing
Targets
Targets
Count
𝑓𝑔(𝑔)
Size
s(𝑔)
Sparsity
𝑠𝑝(𝑔)
Time
Window
𝑡𝑤(𝑔)
Co-reviewing
Ratio
𝑐𝑟(𝑔)
averaged in one spamicity score s𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔
# Reviewed
Products
Common
Between
Members
Density
𝑑(𝑔)
15. SPRAP – Top Ranked Intervals
15
Extracting intervals for each product 𝑞:
Sliding window approach:
𝑤𝑖𝑑𝑡ℎ ∈ [1, |𝑡𝑖𝑚𝑒𝑙𝑖𝑛𝑒 𝑞|]
Huge and redundant space 𝑤𝑖𝑑𝑡ℎ ∈ [1, 𝜏]
Intervals with high spamicity score are reported:
𝑖𝑓 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑡 ≥ 𝜇 ⇒ 𝑡 𝑖𝑠 𝑟𝑒𝑝𝑜𝑟𝑡𝑒𝑑 𝑎𝑠 𝑠𝑝𝑎𝑚 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
What 𝑃?
𝑃 is the intervals empirical distribution:
Contains all valid intervals.
Contains intervals before further filtering.
Added intervals are merged to get wider entities.
16. SPRAP – Collusion Spam Groups
16
Creating all possible groups is infeasible.
We are not only after cliques in the user co-reviewing graph,
so we cannot use Maximum Cliques or MFIM.
We are only considering “valid groups”:
𝑢1
𝑢2
𝑢6
𝑢3
𝑢4
𝑢5
𝑢7
17. SPRAP – Collusion Spam Groups
17
Top
Ranked
Intervals
Initial Groups
CollusionSpammingGroups
Refined Groups
Groups taken directly from
Top Ranked Intervals
Groups after removing
non-spammers
Final reported groups
after merging the refined
groups (not necessarily
cliques)
18. SPRAP – Collusion Spam Groups
18
6 Jan 2020
8-9 Jan 2020
15-17 Dec 2020
19. SPRAP – Collusion Spam Groups
19
𝑃 is the valid groups empirical distribution, but:
The set of created groups is very small.
The majority of created groups is connected to spam
campaigns.
Creating all valid groups is infeasible Sampling!
Straight-forward sampling can lead to a lot of
rejections MCMC!
What 𝑃?
A Group is considered spam if 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 ≥ δ.
20. SPRAP – Collusion Spam Groups
20
Normalization
Schaeffer [2010] dealt with a balanced random walk:
reaches a Uniform stationary distribution.
undirected, unweighted graphs.
𝑝 𝑣,𝑤 =
min 1
deg 𝑣
,
1
deg 𝑤
𝑖𝑓 𝑤 ∈ 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠(𝑣)
1 −
𝑤∈𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑣
min 1
deg 𝑣
,
1
deg 𝑤
𝑖𝑓 𝑤 = 𝑣
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
However, our graph is the user co-reviewing graph and
we want to sample valid groups!
What 𝑃?
21. SPRAP – Collusion Spam Groups
21
Normalization
We define a Valid Groups Markov Chain :
States are valid groups.
We use the defined balanced random walk to sample valid
groups.
No need to build the whole chain before sampling.
We add a random jump with a small probability 𝜖.
What 𝑃?
22. SPRAP – Evaluation
22
Thresholds and Configurations
We estimate the best values of spamicity thresholds (𝜇, 𝛿)
by 5 repetitions of LOOCV.
We set the parameters as follows:
𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3
23. SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
24. SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
25. SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
26. SPRAP – Evaluation
23
General Performance
Data
Set
Intervals Reviews Targets Spammers
Grouped
Spammers
R P R P R P R P R P
A 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1
C 0.926 0.978 0.925 0.985 0.889 1 0.755 0.952 0.755 0.976
D 0.991 0.946 1 0.962 1 0.92 1 0.914 1 0.955
E 1 0.95 0.997 0.974 1 0.963 1 0.922 0.972 0.972
F 0.986 0.939 0.994 0.969 1 0.895 0.989 0.869 0.989 0.989
G 1 0.965 1 0.979 1 0.964 1 0.89 1 0.946
H 1 1 1 1 1 1 1 1 0.938 1
27. SPRAP – Evaluation
24
Wide Dense Campaigns – Effects of 𝜏
Generated Interval Interval in 𝑻 Interval in 𝑰
01-09-2019, 08-09-2019
04-09-2019, 04-09-2019
01-09-2019, 08-09-2019
06-09-2019, 06-09-2019
04-09-2019, 06-09-2019
03-09-2019, 05-09-2019
05-09-2019, 06-09-2019
03-09-2019, 04-09-2019
06-09-2019, 08-09-2019
04-09-2019, 05-09-2019
01-09-2019, 03-09-2019
Details of detecting a time interval of width 8 in data set H.
28. SPRAP – Evaluation
25
Comparison to SPEAGLE [Rayana and Akoglu, 2015]
SPEAGLE reports spammers, fake reviews, and targets.
SPEAGLE depends heavily on textual characteristics
we plant their labeled reviews in data set C whose
spammers are pure spammers.
Algorithm
Reviews Spammers Targets
R P R P R P
SPRAP 0.925 0.985 0.755 0.952 0.889 1
SPEAGLE 1 0.196 1 0.118 1 0.07
Results of SPRAP with 𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3 against the
best achieved recall and precision values for SPEAGLE.
29. SPRAP – Evaluation
26
Merging Groups and Comparison to DeFrauder [Dhawan et al., 2019]
DeFrauder detects collusion spam groups.
We compare between the two methods on the data set
D which has 6 planted collusion spam groups of a mixed
nature.
Algorithm |𝑪| 𝒔 𝒎𝒂𝒙(𝒈)
Spammers Targets
R P R P
SPRAP 9 26 1 0.955 1 0.92
DeFrauder 126 5 0.709 0.329 1 0.383
Results of SPRAP with 𝜇 = 0.4 𝛿 = 0.6 𝜏 = 3 against DeFrauder.
30. SPRAP – Evaluation
27
Merging Groups and Comparison to DeFrauder [Dhawan et al., 2019]
Group
All targets
reviews by
all members
Reported
as 1
group
Original in
refined
groups
Members
Reported
Targets
FP
Members
𝑔1 Yes Yes 7 3/3 7/7 0
𝑔2 No Yes 9 4/4 9/9 0
𝑔3 No No, as 2 3 5/5 3/3 1
𝑔4 No No, as 3 6 12/12 5/5 0
𝑔5 No Yes 15 15/15 10/10 1
𝑔6 No Yes 8 25/25 5/5 1
Reported collusion groups of SPRAP for data set D.
31. SPRAP – Evaluation
28
Amazon Software data
Amazon Software data set:
Unlabeled.
Has 341931 reviews, 275374 users, and 28736 products.
Reported
Entities
Spam
Intervals
Spam Groups Spammers Fake Reviews Targets
Details
𝐼 = 9606 𝐶 = 3797 𝑆 = 37883 𝑌 = 48043 𝑍 = 1066
-
35.5% non-
cliques
37883 with
score ≥ 0.5
- -
- 33374 members - - -
Further notes:
Longest reported time interval is of 71 days.
Biggest reported collusion spam group is of 1139 members.
32. Conclusion
29
Detecting spam campaigns is not trivial due to:
Lack of ground truth.
Huge overlap between spam and genuine behavior.
Evolution of spammers and altering their techniques.
Spamicity scores that depend on a set of indicators can be
a good approximation of the optimal distribution to detect
different spam entities.
We presented SPRAP:
Detects different spam entities with a very good accuracy.
Starts from locating spam time intervals.
Avoids easily broken assumptions.
What I did
33. Conclusion
30
Turning the solution into a full probabilistic anomaly
detection model.
Weighting the spamicity indicator differently to favor some
over the others (e.g. favor groups with more targets.)
Importance groups sampling to include more “close-to-
spam” groups.
What could be done
34. Thank you!
Special thanks to Prof. Vreeken who gave me
the opportunity to be a part of the amazing
EDA group and supported me all over the way,
and Janis for his valuable assistance and his help
throughout the whole process.
I guess I have a Master’s degree now :D
35. References
Jingjing Liu, Yunbo Cao, Chin-Yew Lin, Yalou Huang, and Ming Zhou. Low-quality product review
detection in opinion summarization. Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL), pages 334–342, 2007.
Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang. Catchsync: catching
synchronized behavior in large directed graphs. KDD ’14 Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 941–950, 2014.
Bimal Viswanath, M. Ahmad Bashir, Mark Crovella, Saikat Guha, Krishna P. Gummadi, Balachander
Krishnamurthy, and Alan Mislove. Towards detecting anomalous user behavior in online social
networks. Proceedings of the 23rd USENIX Security Symposium (USENIX Security) , pages 223–238,
2014.
Qiang Cao, Xiaowei Yang, Jieqi Yu,and Christopher Palow. Uncovering large groups of active
malicious accounts in online social networks. CCS ’14 Proceedings of the 2014 ACM SIGSAC
Conference on Computer and Communications Security, pages 477–488, 2014.
Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, and Christos Faloutsos.
Copycatch: stopping group attacks by spotting lockstep behavior in social networks. WWW ’13
Proceedings of the 22nd international conference on World Wide Web , pages 119–130, 2013.
36. References
Zhen Xie and Sencun Zhu. Grouptie: toward hidden collusion group discovery in app stores. WiSec
’14 Proceedings of the 2014 ACM conference on Security and privacy in wireless and mobile
networks, pages 153–164, 2014.
Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. Uncovering collusive spammers in chinese
review websites. CIKM ’13 Proceedings of the 22nd ACM international conference on Information &
Knowledge Management, pages 979–988, 2013.
Shebuti Rayana and Leman Akoglu. Collective opinion spam detection: Bridging review networks
and metadata. KDD ’15 Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 985– 994, 2015.
Sarthika Dhawan, Siva Charan Reddy Gangireddy, Shiv Kumar, and Tanmoy Chakraborty. Spotting
collective behaviour of online frauds in customer reviews. IJCAI-19, pages 245–251, 2019.
Satu Schaeffer. Scalable uniform graph sampling by local computation. SIAM J. Scientific
Computing, 32:2937–2963, 01 2010. doi: 10.1137/080716086.
38. Appendix B
Refining Groups
A Group is considered spam if 𝑠𝑝𝑎𝑚𝑖𝑐𝑖𝑡𝑦 𝑔 ≥ δ.
Refining groups is done by removing the least-spammy user
in each iteration as long as the spamicity is increasing.
The least-spammy user is estimated based on:
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠_ 𝑟𝑎𝑡𝑖𝑜(𝑢) =
𝑡 ∈𝐼 𝟙{𝑢 ∈ 𝑈𝑡}
𝑡 ∈𝑇 𝟙{𝑢 ∈ 𝑈𝑡}
39. Appendix B
Reporting Groups
Merging refined groups is done iteratively as long as the
spamicity of the resulted group is preserved.
In each iteration with merge the pair with the highest
common-users ratio.
Reported Collusion spam groups are not necessarily cliques
in the user co-reviewing graph, unlike the initial and the
refined ones.
46. Appendix F
Amazon Software data
The highest-ranked interval 𝑡 𝑚𝑎𝑥:
Spamicity score = 0.987.
Up-voting campaign with 17 high rates over 2 days.
Low probability since the target has a lot of reviews ∈ {1, 2, 3}.
The highest-ranked collusion group 𝑔 𝑚𝑎𝑥:
Spamicity score = 0.89.
16 users giving 5-rate reviews to one target 𝑞 during 2 days.
Majority of members only reviewed 𝑞.
Corresponding initial group has 27 members.