The slides of our talk at the ECMLPKDD 2017 conference.
Abstract. In this paper we propose a survival factorization framework that models information cascades by tying together social influence pat- terns, topical structure and temporal dynamics. This is achieved through the introduction of a latent space which encodes: (a) the relevance of a information cascade on a topic; (b) the topical authoritativeness and the susceptibility of each individual involved in the information cascade, and (c) temporal topical patterns. By exploiting the cumulative proper- ties of the survival function and of the likelihood of the model on a given adoption log, which records the observed activation times of users and side-information for each cascade, we show that the inference phase is linear in the number of users and in the number of adoptions. The eval- uation on both synthetic and real-world data shows the effectiveness of the model in detecting the interplay between topics and social influence patterns, which ultimately provides high accuracy in predicting users activation times.
The paper is available at
http://ecmlpkdd2017.ijs.si/papers/paperID392.pdf
or alternatively at
https://www.dropbox.com/s/hd3r3z3gwqlx9gp/pkdd17.pdf?dl=0
Code is also available at
https://github.com/gmanco/SurvivalFactorization
1. Survival Factorization
on Diffusion Networks
Nicola Barbieri, Giuseppe Manco and Ettore Ritacco
Tumblr, 35 E 21st St, 10010, New York, USA -
nicola@tumblr.com
ICAR - CNR, via Bucci 7/11C, 87036 Arcavacata di Rende
(CS), ITALY - giuseppe.manco@icar.cnr.it,
ettore.ritacco@icar.cnr.it
2. Context
• Users can create contents
• Contents can be shared within a diffusion network
• The diffusion takes place within cascades
• Trees of timed word-of-mouth chains
8. Formally
• 𝒕"
= 𝑡% 𝑐 , … , 𝑡) 𝑐 a cascade with content 𝑐,
• 𝑁 is the number of users
• 𝑡+ 𝑐 ∈ 0, 𝑇" ∪ ∞ is the timestamp when the node 𝑢 becomes active on
the cascade 𝒕",
• 𝑇" is the time horizon
• The probability of user 𝒖 being infected by user 𝒗 at time 𝑡+ 𝑐 is
given by:
𝑓 𝑡+ 𝑐 |𝑡6 𝑐 , 𝜆+,6 ∝ 𝑒:;<,= >< " :>= "
9. The Infection model
• 𝑣 is the influencer
• 𝜆+,6 represents the influence exerted by 𝑣 on 𝑢
• The transmission rate
• 𝑆 𝑡+ 𝑐 |𝑡6 𝑐 , 𝜆+,6 = 𝑒:;<,= >< " :>= "
is the survival function,
• the probability of resisting the contagion 𝑝 𝑇 ≥ 𝑡+ 𝑐 |𝑡6 𝑐
• 𝑡+ 𝑐 − 𝑡6 𝑐 represents the exposure time
• The longer the delay, the lower the probability of infection
𝑓 𝑡+ 𝑐 |𝑡6 𝑐 , 𝜆+,6 = 𝜆+,6 ⋅ 𝑒:;<,= >< " :>= "
11. Building the Survival Model – Issues
• The nature of the transmission rate 𝝀 determines the adaptiveness of
the model to the personalization of the contagion
• A very fine-grain approach:
• a single value 𝜆+,6 for each pair of users within each cascade
• This approach is intractable in real scenarios
• The matrix 𝚲, containing all the 𝜆+,6, has size 𝑁T
18. The complete model
• Content can be modeled jointly
• E.g., textual content model by a mixture of Poisson distributions expressing
topic dependency
• For each cascade 𝑐 ∈ 1, … , 𝑀
o Sample the topical diffusion pattern,
𝑧" ∼ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝛩
o For each word 𝑤 in 𝑐
§ Sample the occurrences of 𝑤 in 𝑐,
𝑛g," ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛 𝛷
o For each user 𝑢 in 𝑐
§ Sample the user who generated the
contagion, 𝑦+
" ∼ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 Ξ
§ Sample her activation time,
𝑡+ 𝑐 ∼ 𝑊𝑒𝑖𝑏𝑢𝑙𝑙 𝑧", 𝑦+
", 𝐴, 𝑆
Ξ
19. Model Learning
• EM approach
• E-step:
• Update latent variables
• M-step:
• Given the status of the latent variables 𝒁 and 𝒀, update parameters
• Linear complexity!
• The update equations in the EM algorithm can be optimized by exploiting the
factorization of 𝜆+,6
U
20. Model Learning
• EM approach
• E-step:
• Update latent variables
• M-step:
• Given the status of the latent variables 𝒁 and 𝒀, update parameters
• Linear complexity!
• The update equations in the EM algorithm can be optimized by exploiting the
factorization of 𝜆+,6
U
21. Exploiting the model
• We started with four questions:
• Q: Who will share a content?
• A: users infected within a given time horizon
• Q: When will someone share a content?
• A: A sample from 𝑝 𝒕"
|𝒁, 𝒀, 𝚲
• Q: Who is expert in a topic characterizing a set of contents?
• A: Influential users, see 𝐴6,U
• Q: Who is interested in a topic?
• A: Susceptible users, see 𝑆+,U
23. Evaluation
• Activation prediction:
• Two samples of Twitter (filtered/noisy draws)
• Testing protocol:
• Given an incomplete cascade (50%, 80%), fill the missing activations
• Predict activation times
• Influencers and topics:
• MemeTracker dataset
• Testing protocol:
• A semantic (handmade) analysis on the top topics and most influential users
26. Conclusions
• Robust, efficient and accurate modeling of information cascades
• Factorizing the infection rate uncovers highly relevant
information concerning the underlying diffusion process
• Works with general Weibull distribution, not just the exponential
• Future work
• Bayesian learning: The underlying probability distributions allow conjugate priors
• Exploit multiple mutual elicitation processes (e.g. Hawkes processes) in the same
modeling
• Deep architectures for combining heterogeneous content
• Content dynamics within a cascade