Presentation of the KDD 2015 paper describing the RSC model:
RSC: Mining and Modeling Temporal Activity in Social Media
Alceu Ferraz Costa, Yuto Yamaguchi, Agma Juci Machado Traina, Caetano Traina Jr., and Christos Faloutsos
The 21st SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015
RSC: Mining and Modeling Temporal Activity in Social Media
1. RSC: Mining and Modeling Temporal
Activity in Social Media
Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina
Caetano Traina Jr. Christos Faloutsos
1
Universidade
de São Paulo
KDD 2015 – Sydney, Australia
*alceufc@icmc.usp.br
2. Introduction
2
Users generate sequences of time-stamps when
they use a social media Web site
What can we learn from time-stamps?
Are there common patterns?
Can we tell if a user is a bot or a human?
Sequence of tweets
time-stamps:
Bars are tweets
time-stamps
3. Outline
Pattern Mining
What patterns can we discover from temporal
activities of social media users?
Modeling
Bot Detection
Experiments
Conclusion
3
4. Reddit Dataset
Time-stamp from comments
21,198 users
20 Million time-stamps
Twitter Dataset
Time-stamp from tweets
6,790 users
16 Million time-stamps
Pattern Mining: Datasets
For each user we have:
Sequence of postings time-stamps: T = (t1, t2, t3, …)
Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …)
4
t1 t2 t3 t4
∆1 ∆2 ∆3
time
5. Pattern Mining
Pattern 1: Distribution of IAT is heavy-tailed
Users can be inactive for long periods of time before making
new postings
IAT Complementary Cumulative Distribution Function (CCDF)
(log-log axis)
5Reddit Users Twitter Users
6. Pattern Mining
Pattern 2: Bimodal IAT distribution
Users have highly active sections and resting periods
Log-binned histogram of postings IAT
6Twitter Users
10
2
10
4
10
6
0
0.005
0.01
0.015
D, IAT (seconds)
PDF
1st Mode (1min) 2nd Mode (3h)
7. 10
2
10
4
10
6
0
0.005
0.01
D, IAT (seconds)
PDF
Pattern Mining
Pattern 3: Periodic spikes
in the IAT distribution
Caused by daily sleeping
intervals
7
10
5
0
0.005
0.01
0.015
D, IAT (seconds)
PDF
7h 12h 24h 48h 72h
Reddit Users
8. Pattern Mining
Pattern 4: Consecutive IAT are correlated
Long/short IAT are likely to be followed by long/short IAT
Heat-map: pairs
of consecutive IAT
All Reddit users
8
Concentration of
pairs in the
diagonal: positive
correlation
10. RSC Model
Can we generate synthetic time-stamps that match
real data patterns?
10
Pattern
Poisson
Process
Queue
Based
Barabási,
2005
CNPP
Malmgren
et al.,
2009
SFP
Vaz de Melo
et al.,
2013
RSC
Proposed
Model
Heavy
Tails ✔ ✔ ✔
Bimodal
Distribution ✔ ✔
Periodic
Spikes ✔
IAT
Correlation ✔ ✔
Proposed Model: Rest-Sleep-and-Comment
11. RSC Model
Base model: Self-Correlated Process (SCorr)
Definition: A stochastic process is a SCorr process with
base rate λ and correlation ρ if:
Consecutive IAT are correlated:
The i-th IAT ∆i depends on the previous (i-1)-th IAT ∆i-1
ρ controls correlation strength:
If ρ = 0, SCorr reduces to an exponential distribution
11
X ~ Exp(1/λ)
exponential random
variable with rate λ
∆i ~ Exp(ρ∆i-1 + 1/λ)Details
12. SCorr Process
RSC Model
12
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
Consecutive IAT Distribution
SCorr (synthetic data)
λ = 20h, ρ = 0.7
13. RSC Model
13
λ = 20h, ρ = 0.7
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
IAT CCDF
Reddit Data
SCorr
SCorr Process
14. RSC Model
14
λ = 20m, ρ = 1.0
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
IAT Log-binned Histogram
Data
SCorr
SCorr Process
15. RSC Model
Model States
Active:
1. Wait δ ~ SCorr(λA, ρA)
2. Post with probability ppost
3. Transition
Rest:
1. Wait δ ~ SCorr(λR, ρR)
2. Transition
Base rates: λA > λR
Average wait time for active state is
smaller when compared to rest state
State Transitions
15
Active
Rest
1-pR
pR
1-pA pA
Details
16. RSC Model
16
✔ Heavy Tail
✔ Correlated IAT
✔ Bimodal Distribution
✗ Periodic Spikes
IAT Log-binned Histogram
Data
Synth.
SCorr + Rest and Active States
17. RSC Model
Keep track of current time:
tclock variable, 0:00h < tclock < 23:59h
Update tclock after each wait time δ
Enter the sleep state if:
Current state = rest and
(tclock < twake or tclock > tsleep)
In the sleep state:
1. Wait until next wake-up time, twake
2. Transition to rest state
17
tsleep
twake
tclock
Sleep
Awake
Modeling periodic spikes: sleep state
Details
18. RSC Model
18
✔ Heavy Tail
✔ Correlated IAT
✔ Bimodal Distribution
✔ Periodic Spikes
Parameter estimation uses the
Levenberg-Marquardt algorithm
IAT Log-binned Histogram
Complete RSC Model
20. Bot Detection
Problem: Given labeled time-stamp data from a set of
users {U1, U2, U3, …} decide if a unknown user Ui is a
human or a bot.
Solution: RSC-Spotter
Compare users IAT to synthetic IAT generated by the RSC model
If not similar to RSC, then is the user is likely to be a bot
20
0 10 20 30 40 50 60 70
Time (days)
Sequence of time-stamps
from a single user The user that produced
the time-stamps is a
human or a bot?
21. RSC-Spotter
Comparing Time-stamps
Estimate the RSC parameters
Time-stamps from all users
For each user:
1. Compute the IAT histogram
Using log-binned bins
2. Generate synthetic time-
stamps using RSC
RSC can generate the same
number of time-stamps as the user
3. Compare user and synthetic
IAT histogram
Cost sensitive classification is used
to decide if a user is a bot given the
dissimilarity D 21
∆, IAT
Bin Counts
(user data)ci
∆, IAT
Bin Counts
(synthetic) či
D = Σi |ci – či|
(dissimilarity)
Details
23. Reddit Users
Twitter
Users
Experiments: Can RSC Match Real Data?
23
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
RSC
Proposed model
CNPP
Malmgren et al.
SFP
Vaz de Melo et al
CNPP fails to match
the heavy tail
✗ ✔ ✔
24. Experiments: Can RSC Match Real Data?
24
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✔
✔
Two Modes No Periodic
Spikes
Reddit Users
CNPP
Malmgren et al.
25. Experiments: Can RSC Match Real Data?
25
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✔
✔
Reddit Users
Single Mode No Periodic
Spikes
SFP
Vaz de Melo et al
26. Experiments: Can RSC Match Real Data?
26
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✔
✔
✔
✔
Reddit Users
Twitter
Users
Two Modes Periodic
Spikes
Reddit Users
RSC
Proposed model
27. Experiments: Can RSC Match Real Data?
27
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
Twitter
Data
CNPP
Fit
No IAT
Correlation
CNPP
Malmgren et al.
28. Experiments: Can RSC Match Real Data?
28
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
✔
Twitter
Data
SFP
Fit
IAT Correlation
(but too strong!)
SFP
Vaz de Melo et al
29. Experiments: Can RSC Match Real Data?
29
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
✔
✔
Twitter
Data
RSC
Fit
IAT Correlation
RSC
Proposed model
31. Experiments: Can RSC-Spotter Detect Bots?
Methodology
Datasets
Users were manually labeled as bot or humans
Training
Same size for train and test subsets (preserved class distribution)
Baseline features:
31
1,963 Humans
37 BotsReddit
1353 Humans
64 BotsTwitter
1. IAT Histogram
Log-binned IAT
histogram
2. Entropy
Entropy of the
IAT histogram
3. Week Hist.
# of postings
for day of week
4. All features
Combination of
1, 2 and 3
32. Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves
Good performance: curve close to the top
32
Precision > 94%
Sensitivity > 70%
With strongly
imbalanced datasets
# humans >> # bots
Twitter Dataset
33. Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves
Good performance: curve close to the top
33
Precision > 96%
Sensitivity > 47%
With strongly
imbalanced datasets
# humans >> # bots
Reddit Dataset
35. Conclusion
Pattern Mining
Discovered four activity
patterns
RSC-Model
Model that matches the
postings IAT distribution
of social media users
RSC-Spotter
Can tell if a user is a bot
based only on time-
stamp data
35
10
2
10
4
10
6
0
0.005
0.01
D, IAT (seconds)
PDF
36. Thank you!
Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina
Caetano Traina Jr. Christos Faloutsos
36
Universidade
de São Paulo
*alceufc@icmc.usp.br
Datasets and Code: https://github.com/alceufc/rsc_model
38. RSC Spotter – Training
Goal: decide if a dissimilarity D is big enough to say that a user
is a bot
Input: training set of labeled users
Positive examples: bots
Negative examples: humans
1. Estimate pbot = P(user is a bot | D)
Naive-Bayes classifier
Dissimilarity D is a feature
2. Estimate a probability threshold pthresh
Cost sensitive classification
Minimize the weighted harmonic mean between FP and FN errors
Uses only training data
38
Assign costs to False
Positive and False
Negative errors
39. Self-Correlated Process (SCorr)
Exponential distribution:
∆i ~ Exp(β)
PDF: f(x) = βe-xβ
Self-Correlated Process:
Similar to the exponential distribution…
…however β depends on the previous IAT
39
β: mean inter-
arrival time
βi = ρ∆i-1 + 1/λ
When users use Web sites like Twitter or Reddit, they post content such as photos, comments, or tweets.
All these postings are annotated with time-stamps.
So, each user generates a sequence of time-stamps when they use a social media Web site.
For example, we have here the time-line of postings time-stamps from two Twitter users.
- Each bar, is a tweet and the time unit is day.
What can we say just by looking at these time-stamps?
- Are there patterns that are common between all these users?
- Can we tell if they are from a real user or from a bot?
- Is it possible to mimic the time-stamps from a user?
Obs.: maybe we can close showing that the bot scores for the users from the first slide:
Show a bot and a regular user here (without the photo) and ask can we tell which behavior is normal and which one is not normal?
Final slide => show the scores for the users and reveal their photos
In this work we analyzed data from two services: reddit and twitter
For the reddit dataset we have time-stamp sequences from over 20k users and
For the twitter dataset we have time-stamp sequences from over 6k users
For each user had at least a sequence of at least 900 time-stamps.
For the twitter dataset the time-stamps were from tweets.
For the reddit dataset the time-stamps were from user comments.
From each sequence of time-stamps we also computed the IAT (inter-arrival time) between postings
The first pattern discovered from the datasets is the heavy tailed distribution of inter-arrival times.
The plots in this slide shows the tail part of the IAT distribution for reddit and twitter users in log-log axis.
The 2nd pattern we discovered is that the distribution of inter arrival times is bimodal.
The two figures at the bottom part of the slide shows the log-binned histogram of inter-arrival times of all Reddit and Twitter users.
We have a first mode at around 6min and the second mode at 2h mark.
This can be explained by users having
Highly active sections where they make more than one posting in a short interval of time
Resting periods (e.g. working or doing some other activity)
Another pattern we discovered in our data is that consecutive IAT are correlated.
For example, if a user takes a long time to post a tweet, then it is more likely that she will take a long time to post her next tweet.
The figure to the right shows a heat-map of pairs of consecutive IAT.
There is a concentration of pairs in the diagonal of the plot, which indicates a positive correlation.
I will start this next part of the presentation with the following question:
Can we generate synthetic time-stamps that mimics human behavior?
Although there are many mathematical models for human dynamics:
The Poisson Process is not able to match any of the patterns that we found.
Queue Based model, such as the one proposed by Barabasi, is able to generate power-law distributions and matches the heavy-tail pattern.
CNPP (Cascading non-homogeneous Poisson Process) is able to match the bimodal distribution
The SFP process, proposed matches both the heavy tails and correlation between consecutive IAT.
However, no model is able to match all the communication patterns.
Now I will present the RSC model that we propose that is able to match these patterns.
To solve this problem
We solve this problem by proposing the Rest-Sleep-and-Comment model, or RSC model.
We call our model Rest-Sleep-and-Comment (RSC).
The base of RSC is a stochastic process called Self-Correlated Process we proposed that we use to generate IAT.
The green box shows the equation for the IATs generated by SCorr.
The SCorr Process has two parameters: the base rate lambda and the correlation rho and is described by the equation shown in the slide.
In Scorr IATs are exponentially distributed, however, the rate of the exponential distribution depends on the following factors:
The previous IAT, which makes consecutive IAT to be correlated,
The rho parameter, which control strength of the correaltion. For instance, if rho is equal to zero, the Scorr reduces to an exponential distribution.
Before I show the complete RSC model,
this slide shows the distribution of consecutive IAT for the SCorr Process.
The SCorr is able to generate correlated consecutive IAT (notice the concentration along the diagonal of the heat-map).
When we look at the CCDF of the IAT we can also see that SCorr is able to generate heavy tails.
In this slide we are comparing the CDDF of synthetic SCorr time-stamps to real data from Twitter users.
However, when we look at the histogram of IAT for the SCorr, it is possible to see that it does not match the bimodal distribution and the periodic spikes.
Now we improve the RSC model by adding an active and rest state.
In the active state, RSC will:
- wait a time delta generated using SCorr
- make a posting with a probability p_post
- make a transition to the rest state with a probability p_R
In the rest state only contributes to increase the IAT
- wait a time delta generate using Scorr
- make a transition to active state with probability p_A.
The base rate lambda_A for the active state is higher than lambda_R:
The average wait time for the active state is smaller than the wait times for the rest state.
The important thing about the states is that the average wait times
The SCorr parameters for the rest state are selected so that
the base rate lambda_A is
With the active and rest states we can now model the bimodal pattern.
Show that each mode corresponds to a state.
However, this version of the model is not able to match the periodic spikes.
Show that each mode corresponds to a state.
Why should we compare?
We can show here that bots have strange IAT histograms.
- No heavy tails
- Many spikes (posting at regular intervals, e.g. every 10min)
In order to generate synthetic time-stamps that mimics real user behavior
we estimate the RSC parameters using data from all users.
The next step consists in generating a log-binned histogram of IAT for each user.
Now suppose that we have 1,000 time-stamps for this particular user.
We can use RSC to generate exactly 1,000 time-stamps.
Finally, we generate a histogram of synthetic IAT and compare the distance, that is, the difference between the 2 histograms.
We will use this table to summarize the comparison of the models.
the figure shows the distribution of the dissimilarity D computed from users labeled as human and bots from the Twitter dataset.
Most of the users with higher dissimilarity values are bots.
Now we need to decide whether a given value of dissimilarity is big enough to say that the user is a bot.
Given a training set of users labeled either as bots or humans