SlideShare a Scribd company logo
1 of 41
RSC: Mining and Modeling Temporal
Activity in Social Media
Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina
Caetano Traina Jr. Christos Faloutsos
1
Universidade
de São Paulo
KDD 2015 – Sydney, Australia
*alceufc@icmc.usp.br
Introduction
2
Users generate sequences of time-stamps when
they use a social media Web site
What can we learn from time-stamps?
Are there common patterns?
Can we tell if a user is a bot or a human?
Sequence of tweets
time-stamps:
Bars are tweets
time-stamps
Outline
Pattern Mining
What patterns can we discover from temporal
activities of social media users?
Modeling
Bot Detection
Experiments
Conclusion
3
Reddit Dataset
Time-stamp from comments
21,198 users
20 Million time-stamps
Twitter Dataset
Time-stamp from tweets
6,790 users
16 Million time-stamps
Pattern Mining: Datasets
For each user we have:
Sequence of postings time-stamps: T = (t1, t2, t3, …)
Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …)
4
t1 t2 t3 t4
∆1 ∆2 ∆3
time
Pattern Mining
Pattern 1: Distribution of IAT is heavy-tailed
Users can be inactive for long periods of time before making
new postings
IAT Complementary Cumulative Distribution Function (CCDF)
(log-log axis)
5Reddit Users Twitter Users
Pattern Mining
Pattern 2: Bimodal IAT distribution
Users have highly active sections and resting periods
Log-binned histogram of postings IAT
6Twitter Users
10
2
10
4
10
6
0
0.005
0.01
0.015
D, IAT (seconds)
PDF
1st Mode (1min) 2nd Mode (3h)
10
2
10
4
10
6
0
0.005
0.01
D, IAT (seconds)
PDF
Pattern Mining
Pattern 3: Periodic spikes
in the IAT distribution
Caused by daily sleeping
intervals
7
10
5
0
0.005
0.01
0.015
D, IAT (seconds)
PDF
7h 12h 24h 48h 72h
Reddit Users
Pattern Mining
Pattern 4: Consecutive IAT are correlated
Long/short IAT are likely to be followed by long/short IAT
Heat-map: pairs
of consecutive IAT
All Reddit users
8
Concentration of
pairs in the
diagonal: positive
correlation
Outline
Pattern Mining
Modeling
Can we model the patterns?
Bot Detection
Experiments
Conclusion
9
RSC Model
Can we generate synthetic time-stamps that match
real data patterns?
10
Pattern
Poisson
Process
Queue
Based
Barabási,
2005
CNPP
Malmgren
et al.,
2009
SFP
Vaz de Melo
et al.,
2013
RSC
Proposed
Model
Heavy
Tails ✔ ✔ ✔
Bimodal
Distribution ✔ ✔
Periodic
Spikes ✔
IAT
Correlation ✔ ✔
Proposed Model: Rest-Sleep-and-Comment
RSC Model
Base model: Self-Correlated Process (SCorr)
Definition: A stochastic process is a SCorr process with
base rate λ and correlation ρ if:
Consecutive IAT are correlated:
The i-th IAT ∆i depends on the previous (i-1)-th IAT ∆i-1
ρ controls correlation strength:
If ρ = 0, SCorr reduces to an exponential distribution
11
X ~ Exp(1/λ)
exponential random
variable with rate λ
∆i ~ Exp(ρ∆i-1 + 1/λ)Details
SCorr Process
RSC Model
12
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
Consecutive IAT Distribution
SCorr (synthetic data)
λ = 20h, ρ = 0.7
RSC Model
13
λ = 20h, ρ = 0.7
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
IAT CCDF
Reddit Data
SCorr
SCorr Process
RSC Model
14
λ = 20m, ρ = 1.0
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
IAT Log-binned Histogram
Data
SCorr
SCorr Process
RSC Model
Model States
Active:
1. Wait δ ~ SCorr(λA, ρA)
2. Post with probability ppost
3. Transition
Rest:
1. Wait δ ~ SCorr(λR, ρR)
2. Transition
Base rates: λA > λR
Average wait time for active state is
smaller when compared to rest state
State Transitions
15
Active
Rest
1-pR
pR
1-pA pA
Details
RSC Model
16
✔ Heavy Tail
✔ Correlated IAT
✔ Bimodal Distribution
✗ Periodic Spikes
IAT Log-binned Histogram
Data
Synth.
SCorr + Rest and Active States
RSC Model
Keep track of current time:
tclock variable, 0:00h < tclock < 23:59h
Update tclock after each wait time δ
Enter the sleep state if:
Current state = rest and
(tclock < twake or tclock > tsleep)
In the sleep state:
1. Wait until next wake-up time, twake
2. Transition to rest state
17
tsleep
twake
tclock
Sleep
Awake
Modeling periodic spikes: sleep state
Details
RSC Model
18
✔ Heavy Tail
✔ Correlated IAT
✔ Bimodal Distribution
✔ Periodic Spikes
Parameter estimation uses the
Levenberg-Marquardt algorithm
IAT Log-binned Histogram
Complete RSC Model
Outline
Pattern Mining
Modeling
Bot Detection
Can we spot automated behavior based only on time-
stamp data?
Experiments
Conclusion
19
Bot Detection
Problem: Given labeled time-stamp data from a set of
users {U1, U2, U3, …} decide if a unknown user Ui is a
human or a bot.
Solution: RSC-Spotter
Compare users IAT to synthetic IAT generated by the RSC model
If not similar to RSC, then is the user is likely to be a bot
20
0 10 20 30 40 50 60 70
Time (days)
Sequence of time-stamps
from a single user The user that produced
the time-stamps is a
human or a bot?
RSC-Spotter
Comparing Time-stamps
Estimate the RSC parameters
Time-stamps from all users
For each user:
1. Compute the IAT histogram
Using log-binned bins
2. Generate synthetic time-
stamps using RSC
RSC can generate the same
number of time-stamps as the user
3. Compare user and synthetic
IAT histogram
Cost sensitive classification is used
to decide if a user is a bot given the
dissimilarity D 21
∆, IAT
Bin Counts
(user data)ci
∆, IAT
Bin Counts
(synthetic) či
D = Σi |ci – či|
(dissimilarity)
Details
Outline
Pattern Mining
Modeling
Bot Detection
Experiments
Can RSC match real data?
How well can RSC-Spotter detect bots?
Conclusion
22
Reddit Users
Twitter
Users
Experiments: Can RSC Match Real Data?
23
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
RSC
Proposed model
CNPP
Malmgren et al.
SFP
Vaz de Melo et al
CNPP fails to match
the heavy tail
✗ ✔ ✔
Experiments: Can RSC Match Real Data?
24
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✔
✔
Two Modes No Periodic
Spikes
Reddit Users
CNPP
Malmgren et al.
Experiments: Can RSC Match Real Data?
25
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✔
✔
Reddit Users
Single Mode No Periodic
Spikes
SFP
Vaz de Melo et al
Experiments: Can RSC Match Real Data?
26
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✔
✔
✔
✔
Reddit Users
Twitter
Users
Two Modes Periodic
Spikes
Reddit Users
RSC
Proposed model
Experiments: Can RSC Match Real Data?
27
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
Twitter
Data
CNPP
Fit
No IAT
Correlation
CNPP
Malmgren et al.
Experiments: Can RSC Match Real Data?
28
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
✔
Twitter
Data
SFP
Fit
IAT Correlation
(but too strong!)
SFP
Vaz de Melo et al
Experiments: Can RSC Match Real Data?
29
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
✔
✔
Twitter
Data
RSC
Fit
IAT Correlation
RSC
Proposed model
Outline
Pattern Mining
Modeling
Bot Detection
Experiments
Can RSC Match Real Data?
How well can RSC-Spotter detect bots?
Conclusion
30
Experiments: Can RSC-Spotter Detect Bots?
Methodology
Datasets
Users were manually labeled as bot or humans
Training
Same size for train and test subsets (preserved class distribution)
Baseline features:
31
1,963 Humans
37 BotsReddit
1353 Humans
64 BotsTwitter
1. IAT Histogram
Log-binned IAT
histogram
2. Entropy
Entropy of the
IAT histogram
3. Week Hist.
# of postings
for day of week
4. All features
Combination of
1, 2 and 3
Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves
Good performance: curve close to the top
32
Precision > 94%
Sensitivity > 70%
With strongly
imbalanced datasets
# humans >> # bots
Twitter Dataset
Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves
Good performance: curve close to the top
33
Precision > 96%
Sensitivity > 47%
With strongly
imbalanced datasets
# humans >> # bots
Reddit Dataset
Outline
Pattern Mining
Modeling
Bot Detection
Experiments
Conclusion
34
Conclusion
Pattern Mining
Discovered four activity
patterns
RSC-Model
Model that matches the
postings IAT distribution
of social media users
RSC-Spotter
Can tell if a user is a bot
based only on time-
stamp data
35
10
2
10
4
10
6
0
0.005
0.01
D, IAT (seconds)
PDF
Thank you!
Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina
Caetano Traina Jr. Christos Faloutsos
36
Universidade
de São Paulo
*alceufc@icmc.usp.br
Datasets and Code: https://github.com/alceufc/rsc_model
Extra Slides
37
RSC Spotter – Training
Goal: decide if a dissimilarity D is big enough to say that a user
is a bot
Input: training set of labeled users
Positive examples: bots
Negative examples: humans
1. Estimate pbot = P(user is a bot | D)
Naive-Bayes classifier
Dissimilarity D is a feature
2. Estimate a probability threshold pthresh
Cost sensitive classification
Minimize the weighted harmonic mean between FP and FN errors
Uses only training data
38
Assign costs to False
Positive and False
Negative errors
Self-Correlated Process (SCorr)
Exponential distribution:
∆i ~ Exp(β)
PDF: f(x) = βe-xβ
Self-Correlated Process:
Similar to the exponential distribution…
…however β depends on the previous IAT
39
β: mean inter-
arrival time
βi = ρ∆i-1 + 1/λ
RSC: Time-stamp Generation
40
RSC: Complete State Machine
41

More Related Content

Viewers also liked

A Different Perspective on Business with Social Data
A Different Perspective on Business with Social DataA Different Perspective on Business with Social Data
A Different Perspective on Business with Social DataTzar Umang
 
Telecom Data Analysis Using Social Media Feeds
Telecom Data Analysis Using Social Media FeedsTelecom Data Analysis Using Social Media Feeds
Telecom Data Analysis Using Social Media FeedsJuhi Srivastava
 
Social networks, activities, and travel - building links to understand behaviour
Social networks, activities, and travel - building links to understand behaviourSocial networks, activities, and travel - building links to understand behaviour
Social networks, activities, and travel - building links to understand behaviourInstitute for Transport Studies (ITS)
 
Multimedia Data Collection using Social Media Analysis
Multimedia Data Collection using Social Media Analysis Multimedia Data Collection using Social Media Analysis
Multimedia Data Collection using Social Media Analysis Benoit HUET
 
Friendship and mobility user movement in location based social networks
Friendship and mobility user movement in location based social networksFriendship and mobility user movement in location based social networks
Friendship and mobility user movement in location based social networksFread Mzee
 
Spatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter usersSpatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter usersDr Muhammad Adnan
 
Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Felicita Florence
 
A guide to realistic social media and measurement
A guide to realistic social media and measurementA guide to realistic social media and measurement
A guide to realistic social media and measurementAdam Vincenzini
 
20140329 modern logging and data analysis pattern on .NET
20140329 modern logging and data analysis pattern on .NET20140329 modern logging and data analysis pattern on .NET
20140329 modern logging and data analysis pattern on .NETTakayoshi Tanaka
 
Usage and consumption pattern of Social Media- Girish.Havale
Usage and consumption pattern of Social Media- Girish.HavaleUsage and consumption pattern of Social Media- Girish.Havale
Usage and consumption pattern of Social Media- Girish.HavaleGirish Havale
 
Picturing the Social: Talk for Transforming Digital Methods Winter School
Picturing the Social: Talk for Transforming Digital Methods Winter SchoolPicturing the Social: Talk for Transforming Digital Methods Winter School
Picturing the Social: Talk for Transforming Digital Methods Winter SchoolFarida Vis
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesCentre of Geographic Sciences (COGS)
 
Researching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media AnalysisResearching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media AnalysisFarida Vis
 
7 Hot Location-Based Apps You Should Know About
7 Hot Location-Based Apps You Should Know About7 Hot Location-Based Apps You Should Know About
7 Hot Location-Based Apps You Should Know AboutShauna Causey
 

Viewers also liked (14)

A Different Perspective on Business with Social Data
A Different Perspective on Business with Social DataA Different Perspective on Business with Social Data
A Different Perspective on Business with Social Data
 
Telecom Data Analysis Using Social Media Feeds
Telecom Data Analysis Using Social Media FeedsTelecom Data Analysis Using Social Media Feeds
Telecom Data Analysis Using Social Media Feeds
 
Social networks, activities, and travel - building links to understand behaviour
Social networks, activities, and travel - building links to understand behaviourSocial networks, activities, and travel - building links to understand behaviour
Social networks, activities, and travel - building links to understand behaviour
 
Multimedia Data Collection using Social Media Analysis
Multimedia Data Collection using Social Media Analysis Multimedia Data Collection using Social Media Analysis
Multimedia Data Collection using Social Media Analysis
 
Friendship and mobility user movement in location based social networks
Friendship and mobility user movement in location based social networksFriendship and mobility user movement in location based social networks
Friendship and mobility user movement in location based social networks
 
Spatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter usersSpatio-temporal demographic classification of the Twitter users
Spatio-temporal demographic classification of the Twitter users
 
Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .
 
A guide to realistic social media and measurement
A guide to realistic social media and measurementA guide to realistic social media and measurement
A guide to realistic social media and measurement
 
20140329 modern logging and data analysis pattern on .NET
20140329 modern logging and data analysis pattern on .NET20140329 modern logging and data analysis pattern on .NET
20140329 modern logging and data analysis pattern on .NET
 
Usage and consumption pattern of Social Media- Girish.Havale
Usage and consumption pattern of Social Media- Girish.HavaleUsage and consumption pattern of Social Media- Girish.Havale
Usage and consumption pattern of Social Media- Girish.Havale
 
Picturing the Social: Talk for Transforming Digital Methods Winter School
Picturing the Social: Talk for Transforming Digital Methods Winter SchoolPicturing the Social: Talk for Transforming Digital Methods Winter School
Picturing the Social: Talk for Transforming Digital Methods Winter School
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
 
Researching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media AnalysisResearching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media Analysis
 
7 Hot Location-Based Apps You Should Know About
7 Hot Location-Based Apps You Should Know About7 Hot Location-Based Apps You Should Know About
7 Hot Location-Based Apps You Should Know About
 

Similar to RSC: Mining and Modeling Temporal Activity in Social Media

[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...Matteo Ferroni
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationRolando Brondolin
 
Characterizing and Detecting Livestreaming Chatbots
Characterizing and Detecting Livestreaming Chatbots Characterizing and Detecting Livestreaming Chatbots
Characterizing and Detecting Livestreaming Chatbots IIIT Hyderabad
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Matthew Rowe
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Symeon Papadopoulos
 
IRJET - Military Spy Robot with Intelligentdestruction
IRJET - Military Spy Robot with IntelligentdestructionIRJET - Military Spy Robot with Intelligentdestruction
IRJET - Military Spy Robot with IntelligentdestructionIRJET Journal
 
Live Social Semantics @ ESWC2010
Live Social Semantics @ ESWC2010Live Social Semantics @ ESWC2010
Live Social Semantics @ ESWC2010Martin Szomszor
 
A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...Institute of Agricultural Machinery, NARO
 
From Billions to Quintillions: Paving the way to real-time motif discovery in...
From Billions to Quintillions: Paving the way to real-time motif discovery in...From Billions to Quintillions: Paving the way to real-time motif discovery in...
From Billions to Quintillions: Paving the way to real-time motif discovery in...J On The Beach
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...Czech Technical University in Prague
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew Hair
 
Collective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting ConsumersCollective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting ConsumersRikiya Takahashi
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...yeung2000
 
Robot navigation in unknown environment with obstacle recognition using laser...
Robot navigation in unknown environment with obstacle recognition using laser...Robot navigation in unknown environment with obstacle recognition using laser...
Robot navigation in unknown environment with obstacle recognition using laser...IJECEIAES
 
Jaswanth-PPT.pptx
Jaswanth-PPT.pptxJaswanth-PPT.pptx
Jaswanth-PPT.pptxreenarocky
 
ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
 

Similar to RSC: Mining and Modeling Temporal Activity in Social Media (20)

[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With Degradation
 
Characterizing and Detecting Livestreaming Chatbots
Characterizing and Detecting Livestreaming Chatbots Characterizing and Detecting Livestreaming Chatbots
Characterizing and Detecting Livestreaming Chatbots
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
IRJET - Military Spy Robot with Intelligentdestruction
IRJET - Military Spy Robot with IntelligentdestructionIRJET - Military Spy Robot with Intelligentdestruction
IRJET - Military Spy Robot with Intelligentdestruction
 
Live Social Semantics @ ESWC2010
Live Social Semantics @ ESWC2010Live Social Semantics @ ESWC2010
Live Social Semantics @ ESWC2010
 
Dealing with the need for Infrastructural Support in Ambient Intelligence
Dealing with the need for Infrastructural Support in Ambient IntelligenceDealing with the need for Infrastructural Support in Ambient Intelligence
Dealing with the need for Infrastructural Support in Ambient Intelligence
 
A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...A multi-sensor based uncut crop edge detection method for head-feeding combin...
A multi-sensor based uncut crop edge detection method for head-feeding combin...
 
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Understanding City Traffic Dynamics Utilizing Sensor and Textual ObservationsUnderstanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
 
From Billions to Quintillions: Paving the way to real-time motif discovery in...
From Billions to Quintillions: Paving the way to real-time motif discovery in...From Billions to Quintillions: Paving the way to real-time motif discovery in...
From Billions to Quintillions: Paving the way to real-time motif discovery in...
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...Botnets behavioral patterns in the network. A Machine Learning study of botne...
Botnets behavioral patterns in the network. A Machine Learning study of botne...
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3
 
Collective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting ConsumersCollective Response Spike Prediction for Mutually Interacting Consumers
Collective Response Spike Prediction for Mutually Interacting Consumers
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
 
Robot navigation in unknown environment with obstacle recognition using laser...
Robot navigation in unknown environment with obstacle recognition using laser...Robot navigation in unknown environment with obstacle recognition using laser...
Robot navigation in unknown environment with obstacle recognition using laser...
 
Jaswanth-PPT.pptx
Jaswanth-PPT.pptxJaswanth-PPT.pptx
Jaswanth-PPT.pptx
 
ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identification
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

RSC: Mining and Modeling Temporal Activity in Social Media

  • 1. RSC: Mining and Modeling Temporal Activity in Social Media Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina Caetano Traina Jr. Christos Faloutsos 1 Universidade de São Paulo KDD 2015 – Sydney, Australia *alceufc@icmc.usp.br
  • 2. Introduction 2 Users generate sequences of time-stamps when they use a social media Web site What can we learn from time-stamps? Are there common patterns? Can we tell if a user is a bot or a human? Sequence of tweets time-stamps: Bars are tweets time-stamps
  • 3. Outline Pattern Mining What patterns can we discover from temporal activities of social media users? Modeling Bot Detection Experiments Conclusion 3
  • 4. Reddit Dataset Time-stamp from comments 21,198 users 20 Million time-stamps Twitter Dataset Time-stamp from tweets 6,790 users 16 Million time-stamps Pattern Mining: Datasets For each user we have: Sequence of postings time-stamps: T = (t1, t2, t3, …) Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …) 4 t1 t2 t3 t4 ∆1 ∆2 ∆3 time
  • 5. Pattern Mining Pattern 1: Distribution of IAT is heavy-tailed Users can be inactive for long periods of time before making new postings IAT Complementary Cumulative Distribution Function (CCDF) (log-log axis) 5Reddit Users Twitter Users
  • 6. Pattern Mining Pattern 2: Bimodal IAT distribution Users have highly active sections and resting periods Log-binned histogram of postings IAT 6Twitter Users 10 2 10 4 10 6 0 0.005 0.01 0.015 D, IAT (seconds) PDF 1st Mode (1min) 2nd Mode (3h)
  • 7. 10 2 10 4 10 6 0 0.005 0.01 D, IAT (seconds) PDF Pattern Mining Pattern 3: Periodic spikes in the IAT distribution Caused by daily sleeping intervals 7 10 5 0 0.005 0.01 0.015 D, IAT (seconds) PDF 7h 12h 24h 48h 72h Reddit Users
  • 8. Pattern Mining Pattern 4: Consecutive IAT are correlated Long/short IAT are likely to be followed by long/short IAT Heat-map: pairs of consecutive IAT All Reddit users 8 Concentration of pairs in the diagonal: positive correlation
  • 9. Outline Pattern Mining Modeling Can we model the patterns? Bot Detection Experiments Conclusion 9
  • 10. RSC Model Can we generate synthetic time-stamps that match real data patterns? 10 Pattern Poisson Process Queue Based Barabási, 2005 CNPP Malmgren et al., 2009 SFP Vaz de Melo et al., 2013 RSC Proposed Model Heavy Tails ✔ ✔ ✔ Bimodal Distribution ✔ ✔ Periodic Spikes ✔ IAT Correlation ✔ ✔ Proposed Model: Rest-Sleep-and-Comment
  • 11. RSC Model Base model: Self-Correlated Process (SCorr) Definition: A stochastic process is a SCorr process with base rate λ and correlation ρ if: Consecutive IAT are correlated: The i-th IAT ∆i depends on the previous (i-1)-th IAT ∆i-1 ρ controls correlation strength: If ρ = 0, SCorr reduces to an exponential distribution 11 X ~ Exp(1/λ) exponential random variable with rate λ ∆i ~ Exp(ρ∆i-1 + 1/λ)Details
  • 12. SCorr Process RSC Model 12 ✔ Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes Consecutive IAT Distribution SCorr (synthetic data) λ = 20h, ρ = 0.7
  • 13. RSC Model 13 λ = 20h, ρ = 0.7 ✔ Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes IAT CCDF Reddit Data SCorr SCorr Process
  • 14. RSC Model 14 λ = 20m, ρ = 1.0 ✔ Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes IAT Log-binned Histogram Data SCorr SCorr Process
  • 15. RSC Model Model States Active: 1. Wait δ ~ SCorr(λA, ρA) 2. Post with probability ppost 3. Transition Rest: 1. Wait δ ~ SCorr(λR, ρR) 2. Transition Base rates: λA > λR Average wait time for active state is smaller when compared to rest state State Transitions 15 Active Rest 1-pR pR 1-pA pA Details
  • 16. RSC Model 16 ✔ Heavy Tail ✔ Correlated IAT ✔ Bimodal Distribution ✗ Periodic Spikes IAT Log-binned Histogram Data Synth. SCorr + Rest and Active States
  • 17. RSC Model Keep track of current time: tclock variable, 0:00h < tclock < 23:59h Update tclock after each wait time δ Enter the sleep state if: Current state = rest and (tclock < twake or tclock > tsleep) In the sleep state: 1. Wait until next wake-up time, twake 2. Transition to rest state 17 tsleep twake tclock Sleep Awake Modeling periodic spikes: sleep state Details
  • 18. RSC Model 18 ✔ Heavy Tail ✔ Correlated IAT ✔ Bimodal Distribution ✔ Periodic Spikes Parameter estimation uses the Levenberg-Marquardt algorithm IAT Log-binned Histogram Complete RSC Model
  • 19. Outline Pattern Mining Modeling Bot Detection Can we spot automated behavior based only on time- stamp data? Experiments Conclusion 19
  • 20. Bot Detection Problem: Given labeled time-stamp data from a set of users {U1, U2, U3, …} decide if a unknown user Ui is a human or a bot. Solution: RSC-Spotter Compare users IAT to synthetic IAT generated by the RSC model If not similar to RSC, then is the user is likely to be a bot 20 0 10 20 30 40 50 60 70 Time (days) Sequence of time-stamps from a single user The user that produced the time-stamps is a human or a bot?
  • 21. RSC-Spotter Comparing Time-stamps Estimate the RSC parameters Time-stamps from all users For each user: 1. Compute the IAT histogram Using log-binned bins 2. Generate synthetic time- stamps using RSC RSC can generate the same number of time-stamps as the user 3. Compare user and synthetic IAT histogram Cost sensitive classification is used to decide if a user is a bot given the dissimilarity D 21 ∆, IAT Bin Counts (user data)ci ∆, IAT Bin Counts (synthetic) či D = Σi |ci – či| (dissimilarity) Details
  • 22. Outline Pattern Mining Modeling Bot Detection Experiments Can RSC match real data? How well can RSC-Spotter detect bots? Conclusion 22
  • 23. Reddit Users Twitter Users Experiments: Can RSC Match Real Data? 23 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation RSC Proposed model CNPP Malmgren et al. SFP Vaz de Melo et al CNPP fails to match the heavy tail ✗ ✔ ✔
  • 24. Experiments: Can RSC Match Real Data? 24 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✔ ✔ Two Modes No Periodic Spikes Reddit Users CNPP Malmgren et al.
  • 25. Experiments: Can RSC Match Real Data? 25 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✔ ✔ Reddit Users Single Mode No Periodic Spikes SFP Vaz de Melo et al
  • 26. Experiments: Can RSC Match Real Data? 26 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✔ ✔ ✔ ✔ Reddit Users Twitter Users Two Modes Periodic Spikes Reddit Users RSC Proposed model
  • 27. Experiments: Can RSC Match Real Data? 27 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ Twitter Data CNPP Fit No IAT Correlation CNPP Malmgren et al.
  • 28. Experiments: Can RSC Match Real Data? 28 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ ✔ Twitter Data SFP Fit IAT Correlation (but too strong!) SFP Vaz de Melo et al
  • 29. Experiments: Can RSC Match Real Data? 29 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ ✔ ✔ Twitter Data RSC Fit IAT Correlation RSC Proposed model
  • 30. Outline Pattern Mining Modeling Bot Detection Experiments Can RSC Match Real Data? How well can RSC-Spotter detect bots? Conclusion 30
  • 31. Experiments: Can RSC-Spotter Detect Bots? Methodology Datasets Users were manually labeled as bot or humans Training Same size for train and test subsets (preserved class distribution) Baseline features: 31 1,963 Humans 37 BotsReddit 1353 Humans 64 BotsTwitter 1. IAT Histogram Log-binned IAT histogram 2. Entropy Entropy of the IAT histogram 3. Week Hist. # of postings for day of week 4. All features Combination of 1, 2 and 3
  • 32. Experiments: Can RSC-Spotter Detect Bots? Precision vs. Sensitivity Curves Good performance: curve close to the top 32 Precision > 94% Sensitivity > 70% With strongly imbalanced datasets # humans >> # bots Twitter Dataset
  • 33. Experiments: Can RSC-Spotter Detect Bots? Precision vs. Sensitivity Curves Good performance: curve close to the top 33 Precision > 96% Sensitivity > 47% With strongly imbalanced datasets # humans >> # bots Reddit Dataset
  • 35. Conclusion Pattern Mining Discovered four activity patterns RSC-Model Model that matches the postings IAT distribution of social media users RSC-Spotter Can tell if a user is a bot based only on time- stamp data 35 10 2 10 4 10 6 0 0.005 0.01 D, IAT (seconds) PDF
  • 36. Thank you! Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina Caetano Traina Jr. Christos Faloutsos 36 Universidade de São Paulo *alceufc@icmc.usp.br Datasets and Code: https://github.com/alceufc/rsc_model
  • 38. RSC Spotter – Training Goal: decide if a dissimilarity D is big enough to say that a user is a bot Input: training set of labeled users Positive examples: bots Negative examples: humans 1. Estimate pbot = P(user is a bot | D) Naive-Bayes classifier Dissimilarity D is a feature 2. Estimate a probability threshold pthresh Cost sensitive classification Minimize the weighted harmonic mean between FP and FN errors Uses only training data 38 Assign costs to False Positive and False Negative errors
  • 39. Self-Correlated Process (SCorr) Exponential distribution: ∆i ~ Exp(β) PDF: f(x) = βe-xβ Self-Correlated Process: Similar to the exponential distribution… …however β depends on the previous IAT 39 β: mean inter- arrival time βi = ρ∆i-1 + 1/λ
  • 41. RSC: Complete State Machine 41

Editor's Notes

  1. When users use Web sites like Twitter or Reddit, they post content such as photos, comments, or tweets. All these postings are annotated with time-stamps. So, each user generates a sequence of time-stamps when they use a social media Web site. For example, we have here the time-line of postings time-stamps from two Twitter users. - Each bar, is a tweet and the time unit is day. What can we say just by looking at these time-stamps? - Are there patterns that are common between all these users? - Can we tell if they are from a real user or from a bot? - Is it possible to mimic the time-stamps from a user? Obs.: maybe we can close showing that the bot scores for the users from the first slide: Show a bot and a regular user here (without the photo) and ask can we tell which behavior is normal and which one is not normal? Final slide => show the scores for the users and reveal their photos
  2. In this work we analyzed data from two services: reddit and twitter For the reddit dataset we have time-stamp sequences from over 20k users and For the twitter dataset we have time-stamp sequences from over 6k users For each user had at least a sequence of at least 900 time-stamps. For the twitter dataset the time-stamps were from tweets. For the reddit dataset the time-stamps were from user comments. From each sequence of time-stamps we also computed the IAT (inter-arrival time) between postings
  3. The first pattern discovered from the datasets is the heavy tailed distribution of inter-arrival times. The plots in this slide shows the tail part of the IAT distribution for reddit and twitter users in log-log axis.
  4. The 2nd pattern we discovered is that the distribution of inter arrival times is bimodal. The two figures at the bottom part of the slide shows the log-binned histogram of inter-arrival times of all Reddit and Twitter users. We have a first mode at around 6min and the second mode at 2h mark. This can be explained by users having Highly active sections where they make more than one posting in a short interval of time Resting periods (e.g. working or doing some other activity)
  5. Another pattern we discovered in our data is that consecutive IAT are correlated. For example, if a user takes a long time to post a tweet, then it is more likely that she will take a long time to post her next tweet. The figure to the right shows a heat-map of pairs of consecutive IAT. There is a concentration of pairs in the diagonal of the plot, which indicates a positive correlation.
  6. I will start this next part of the presentation with the following question: Can we generate synthetic time-stamps that mimics human behavior? Although there are many mathematical models for human dynamics: The Poisson Process is not able to match any of the patterns that we found. Queue Based model, such as the one proposed by Barabasi, is able to generate power-law distributions and matches the heavy-tail pattern. CNPP (Cascading non-homogeneous Poisson Process) is able to match the bimodal distribution The SFP process, proposed matches both the heavy tails and correlation between consecutive IAT. However, no model is able to match all the communication patterns. Now I will present the RSC model that we propose that is able to match these patterns. To solve this problem We solve this problem by proposing the Rest-Sleep-and-Comment model, or RSC model.
  7. We call our model Rest-Sleep-and-Comment (RSC). The base of RSC is a stochastic process called Self-Correlated Process we proposed that we use to generate IAT. The green box shows the equation for the IATs generated by SCorr. The SCorr Process has two parameters: the base rate lambda and the correlation rho and is described by the equation shown in the slide. In Scorr IATs are exponentially distributed, however, the rate of the exponential distribution depends on the following factors: The previous IAT, which makes consecutive IAT to be correlated, The rho parameter, which control strength of the correaltion. For instance, if rho is equal to zero, the Scorr reduces to an exponential distribution.
  8. Before I show the complete RSC model, this slide shows the distribution of consecutive IAT for the SCorr Process. The SCorr is able to generate correlated consecutive IAT (notice the concentration along the diagonal of the heat-map).
  9. When we look at the CCDF of the IAT we can also see that SCorr is able to generate heavy tails. In this slide we are comparing the CDDF of synthetic SCorr time-stamps to real data from Twitter users.
  10. However, when we look at the histogram of IAT for the SCorr, it is possible to see that it does not match the bimodal distribution and the periodic spikes.
  11. Now we improve the RSC model by adding an active and rest state. In the active state, RSC will: - wait a time delta generated using SCorr - make a posting with a probability p_post - make a transition to the rest state with a probability p_R In the rest state only contributes to increase the IAT - wait a time delta generate using Scorr - make a transition to active state with probability p_A. The base rate lambda_A for the active state is higher than lambda_R: The average wait time for the active state is smaller than the wait times for the rest state. The important thing about the states is that the average wait times The SCorr parameters for the rest state are selected so that the base rate lambda_A is
  12. With the active and rest states we can now model the bimodal pattern. Show that each mode corresponds to a state. However, this version of the model is not able to match the periodic spikes.
  13. Show that each mode corresponds to a state.
  14. Why should we compare? We can show here that bots have strange IAT histograms. - No heavy tails - Many spikes (posting at regular intervals, e.g. every 10min)
  15. In order to generate synthetic time-stamps that mimics real user behavior we estimate the RSC parameters using data from all users. The next step consists in generating a log-binned histogram of IAT for each user. Now suppose that we have 1,000 time-stamps for this particular user. We can use RSC to generate exactly 1,000 time-stamps. Finally, we generate a histogram of synthetic IAT and compare the distance, that is, the difference between the 2 histograms.
  16. We will use this table to summarize the comparison of the models.
  17. the figure shows the distribution of the dissimilarity D computed from users labeled as human and bots from the Twitter dataset. Most of the users with higher dissimilarity values are bots. Now we need to decide whether a given value of dissimilarity is big enough to say that the user is a bot. Given a training set of users labeled either as bots or humans