This thesis constitutes one of the first investigations that lie at the intersection of social influence propagation, viral marketing, and social advertising. The objective of this thesis is to take the algorithmic aspects of viral marketing out of the lab, and further enhance these aspects to account for the real world social advertisement models, by drawing on the viral marketing literature to study social influence aware ad allocation for social advertising. To this end, we take a first step towards enabling social influence online analytics in support of viral marketing decision making, and propose efficient influence indexing framework that can accurately answer topic-aware viral marketing queries with milliseconds response time. We then initiate investigation in the area of social advertising through the viral marketing lens, aligned with real world social advertisement models, and introduce two fundamental optimization problems, regarding the allocation of ads to social network users under social influence. We devise greedy approximation algorithms with provable approximation guarantees for the novel problems introduced. We also develop scalable versions of our approximation algorithms by leveraging the notion of reverse reachability sampling on social graphs, and experimentally confirm that our algorithms are scalable and deliver high quality solutions.
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Aslay Ph.D. Defense
1. From Viral Marketing to Social Advertising:
Ad Allocation Under Social Influence
Çiğdem Aslay (UPF)
1
Supervisors:
Prof. Dr. Ricardo Baeza-Yates (UPF)
Dr. Francesco Bonchi (ISI)
2. Outline
• Introduction
• Influence in Online Social Networks
• Viral Marketing and Influence Maximization
• Social Advertising: Promoted Posts
• Part I - Online Topic-aware Influence Maximization Queries
• Part II - Social Advertising: Regret Minimization
• Part III - Social Advertising: Revenue Maximization
• Conclusion
3. Influence in Online Social Networks
Grumpy Cat
• 25K+ votes in Reddit (< 1 day)
• 1M+ views in Imgur
• 300+ variants in Reddit
• 100+ Quickmeme macros
nice meme! indeed!
(< 2 days)
3
• Social Influence Induced Viral Phenomena
4. 4
• Attached a promotional message with
a clickable URL for free sign up
• Merely spent $50K
• 12M users signed up within the first
18 months
• Sign-up to the service only through
invitation from a friend
• No money spent on marketing
• Resulted in bidding on Ebay for
invites
Influence in Online Social Networks
Viral Marketing*
exploit the “word of mouth” effect in a social network to achieve
marketing goals through self-replicating viral processes
* S. Jurvetson, “What Exactly is Viral Marketing”, Red Herring
5. • Given
• a directed social network G = (V,E)
• a propagation model m
• a cardinality budget k
• Define
• S: initial set of k (seed) nodes to start the propagation
• σm(S): expected size of the influence propagation from S
• Find
S⇤
= argmax
S✓V,|S|=k
m(S)
Influence Maximization
* Kempe et al., “Maximizing the spread of influence through a social network”, KDD 2003 5
Discrete Optimization Problem*
6. Influence Propagation Models
Independent Cascade (IC) Model
• Each arc (u,v) is associated with an influence probability puv
• A node u activated at time t tries to influence each inactive neighbor v, with a
success probability puv
Topic-aware Independent Cascade (TIC) Model*
• An item i described as a distribution over K topics:
• Topic specific influence probabilities on arcs:
• Item specific success probabilities on arcs:
*N. Barbieri, F. Bonchi and G. Manco, “Topic-aware Social Influence Propagation Models”, ICDM 2012 6
7. Complexity and Approximation
• Influence Maximization is NP-Hard under both models
• TIC boils down to IC on the probabilistic graph Gi = (V,A,pi)
• Reduction from the Set Cover problem
• Greedy algorithm
• (1 – 1/e)-approximation* using monotonicity1 and submodularity2
7
#P-hard
*Nemhauser et al., “An analysis of approximations for maximizing submodular set functions I”, Mathematical Programming 1978
8. • Implemented by online social networking platforms
• “Promoted Posts” are injected to the social feeds of users
• Similar to organic posts from friends in a social network
• Contain an advertising message: text, image or video
• Can propagate to friends via social actions: “likes”, “shares”
• Each click to a promoted post produces social proof to friends
• Advertisers have to pay for engagements / clicks
8
Social Advertising
A market that did not exist until Facebook launched its first
advertising service in May 2005, projected to generate $11
billion revenue by 2017*
* http://www.unified.com/historyofsocialadvertising/
9. 9
Motivation
• Part II - Social Advertising: Regret Minimization
• Part III - Social Advertising: Revenue Maximization
• Part I - Online Topic-aware Influence Maximization Queries
Enable online social influence analytics in support of
viral marketing decision making
Influence
Maximization
Computational
Advertising
10. Part I
Online Topic-aware Influence
Maximization Queries
• C. Aslay, N. Barbieri, F. Bonchi, and R. Baeza-Yates. “Online
Topic-aware Influence Maximization Queries”. Published in
International Conference on Extending Database Technology
(EDBT) 2014.
11. Given
• a social graph G = (V,E)
• a space of Z topics
• topic-specific peer-influence probabilities on arcs, pz
u,v
• a query item q,
• cardinality budget k
• A TIM query asks to find a seed set of k nodes that maximizes the
expected number of nodes adopting item q in the network:
11
Topic-aware Influence
Maximization (TIM) Queries
12. • TIM query can be processed by any influence maximization algorithm:
• Reduce TIC to IC via the derived graph Gq = (V,A,pq)
• Enjoy (1 – 1/e)-approximation guarantee
12
Topic-aware Influence
Maximization (TIM) Queries
*Goyal et al., “CELF++: optimizing the greedy algorithm for influence maximization in social networks ”, WWW 2011
• Challenge: enormous number of potential queries
• Any possible point lying on the probability simplex
• Any potential query induces a different probabilistic graph
• Indexing is necessary for online TIM query processing
• Need milliseconds response to enable online viral marketing analytics
Efficiency compromised:
Takes days to process a single query for k = 50 on a graph with 30K nodes and
425K edges with CELF++*
13. Influence Index
Index over pre-computed
solutions of a limited number
of TIM queries.
13
• Similar peer influence probabilities
• Similar influence propagation patterns
Similar items are likely to
interest similar users
INFLEX
14. Index Construction (Offline)
• Phase 1: seed node extraction
• Phase 2: tree-based index construction
• Phase 3: list-based index construction
Query Processing (Online)
• Phase1: topic-wise NNs retrieval
• Phase 2: aggregation of pre-computed
seed sets of NN’s wrt topic-wise similarity
14
15. Selection of Index Items
• Space-based selection:
• Equi-distantly positioned topic distributions on the probability simplex
• (+) Fair coverage of the simplex
• (-) Disregards the available workload
• Data-driven selection:
• Catalog of items learnt from the log of past propagations
• (+) Queried items likely to follow the distributions learnt from past data
• (-) Sparsity issues for skewed topic distributions in the catalog
The best of
both approaches
Simplex Sampling
15
16. Selection of Index Items
• Sampling from the probability simplex
• Estimate the Dirichlet distribution maximizing the log-likelihood of the
available workload
• Generate a large sample with good simplex coverage
• Bregman K-means++ clustering on the generated sample
• Take distributions on the centroids as the index items
16
17. Tree Construction
• KL-Divergence for measuring similarity btw. probability distributions
1 Cayton, “Fast Nearest Neighbor Retrieval for Bregman Divergences”, ICML 2008
2 Nielsen et al., “Tailored Bregman Ball Trees for Effective Nearest Neighbors”, EuroCG 2009
Bregman Ball Trees1,2
• Hierarchical space partition based on convex Bregman Balls:
• Bregman k-means++ to generate child nodes from parent nodes
• Gaussian clustering to find the optimal number of child nodes (k in k-means)
non-metric search space!
17
18. • Neither range nor k-NN search
• Anderson-Darling statistical test as stopping criterion
• if so far visited leaves provide “good enough” neigbours, return
• DFS starting from the root node to the leaf nodes
• Navigation via projection of the query point onto Bregman balls
• Pruning strategy
• use an upper bound from current NN set:
• visit subtree only if it improves the current bound:
Similarity Search
18
19. Rank Aggregation
• Combine the seed node rankings of NN’s into a “consensus” ranking
Kemeny-Optimal Rank Aggregation
• Find a ranked list that has the min. Kendall-Tau distance to the input lists
• Kendall-Tau distance: # of pairwise disagreements between 2 ranked lists
NP-Hard even for 4
input permutations*
Approximation via techniques from Social Choice Theory
19*Dwork et al., “Rank aggregation methods for the web.”, WWW 2001
20. INFLEX – Rank Aggregation
Aggregation weights: non-linear
transformation of KL-Divergence
Social Choice Theory strives for fairness..
Weighted Borda Aggregation
• Borda score: total # of list-elements preceded in all the input lists
• 5-approximation to the optimal Kemeny ranking
Weighted Copeland Aggregation
• Copeland score: total # of list-elements that were defeated in the
pairwise comparison among all the input lists
• 4-approximation to the optimal Kemeny ranking
20
21. Experiments
• Real-world FLIXSTER dataset
• Social graph: 30K users, 425k unidirectional social links
• Propagation Log {(User, Movie, Time)}
• Ratings on 12K movies
• Benchmarks devised via various INFLEX components
• exactKNN: exact K-NN search (with best performing K)
• approxKNN: approximate K-NN search (with best performing K)
• approxKNN + Sel: approximate K-NN search + automatic list selection
• approxAD: Anderson-Darling test based approximate NN search
• INFLEX: Anderson-Darling test based approximate NN search with automatic
list selection
21https://github.com/aslayci/INFLEX
22. • Ground truth: standard (offline) greedy algorithm
22
Experiments
23. Part II
Social Advertising:
Regret Minimization
• C. Aslay, W. Lu, F. Bonchi, A. Goyal, and, L. V. Lakshmanan.
“Viral Marketing Meets Social Advertising: Ad Allocation with
Minimum Regret”. Published in International Conference on
Very Large Data Bases (VLDB) 2015.
24. Social Advertising
Cost per Engagement (CPE) Model
• The social network platform owner (a.k.a. host)
– Sells “ad-engagements” (“clicks”) to advertisers
– Inserts promoted posts to the social feed of users likely to click
– high click-through-probability (CTP)
• Advertiser
– Willing to pay a fixed CPE to host for each click
24
Ad allocation under social influence
Strategically allocate users to advertisers, leveraging social influence and
the propensity of ads to propagate, subject to limited advertisers’ budgets
25. TIC-CTP Propagation Model
Extending TIC model with Click-Through-Probabilities
• Balance between intrinsic relevance in the absence of social proof and
peer influence
• Ad-specific CTP for each user: δ(u,i)
• Probability that user u will click ad i in the absence of social proof
• Lemma 4.1: TIC-CTP reduces to TIC model with pi
H,u = δ(u,i)
• When δ(u,i) = 1 for all u and i, TIC = TIC-CTP
v
u
wH
puw
puv
pHv
pHw
pHu
25
26. Budget and Regret
• Host:
• Owns directed social graph G = (V,E) and TIC-CTP model instance
• Sets user attention bound κu for each user u ∊ V
• Advertiser i:
• agrees to pay CPE(i) for each click up to his budget Bi
• Exp. revenue of the host from allocating seed set Si to advertiser i:
min(σi(Si) × CPE(i), Bi)
• σi(Si) × CPE(i) < Bi : Lost revenue opportunity for the host
• σi(Si) × CPE(i) > Bi : Free service to the advertiser
Host’s
regret
26
27. Budget and Regret
(Raw) Allocation Regret
• Regret of the host from allocating seed set Si to advertiser i:
Ri(Si) = |Bi − σi(Si) × CPE(i)|
• Overall allocation regret:
R(S1, …, Sh) = Ri(Si)
i=1
h
∑
Penalized Allocation Regret
• λ: penalty to discourage selecting large number of poor quality seeds
• Regret of the host with seed set size penalization
Ri(Si) = |Bi − σi(Si) × CPE(i)| + λ × |Si|
27
28. Regret Minimization
• Given
• a social graph G = (V,E)
• TIC-CTP propagation model
• h advertisers with budget Bi and CPE(i) for each advertiser i
• attention bound κu for each user u ∊ V
• penalty parameter λ ≥ 0
• Find a valid allocation S = (S1, …, Sh) that minimizes the
overall regret of the host from the allocation:
28
29. Theoretical Analysis
• Regret-Minimization is NP-hard and is NP-hard to approximate
• Reduction from 3-PARTITION problem
• Regret function is neither monotone nor submodular
• Still, a greedy algorithm:
29
selects the (ad,user)
that gives the max.
reduction in regret
30. Approximation guarantee w.r.t. the total budget of all advertisers
• Theorem 4.2: Penalized allocation regret
• Raw allocation regret
• Theorem 4.3:
• Theorem 4.4:
Theoretical Analysis
30
31. Scalable Algorithms
Two-Phase Iterative Regret Minimization (TIRM)
* Tang et al., “Influence maximization: Near-optimal time complexity meets practical efficiency”, SIGMOD 2014
Two-Phase Influence Maximization (TIM) Algorithm*
• Estimates influence spread for the most influential “s” nodes from a
random sample of “θ(s)” RR-Sets
θ(s): statistically sufficient sample size needed for accurate
estimation of the influence spread of s nodes
Estimator:
TIM cannot be used for minimizing the regret
Does not handle CTPs
Requires predefined seed set size s
Built on the Reverse Influence Sampling framework of TIM
31
32. (1) RR-sets sampling under TIC-CTP model: RRC-sets
• Sample a random RR set R for advertiser i
• Remove every node u in R with probability 1 – δ(u,i)
• Form “RRC-set” from the remaining nodes
Scalability compromised:
Requires at least 2 orders of magnitude bigger
sample size for CTP = 0.01.
Theorem 4.5: MG(u | S) in IC-CTP = δ(u) * MG(u | S) in IC
TIRM
32
33. TIRM
For each advertiser i:
• Start with a “safe” initial seed set size si
• Sample θi(si) RR sets required for si
• Update si based on current regret
• Revise θi(si), sample additional RR sets, revise estimates
(2) Iterative Seed Set Size Estimation
Estimation accuracy of TIRM Theorem 4.6
33
34. Datasets and Parameters
TIC EM
Learning
Exponential
Distribution
WC
Model
WC
Model
sampled uniformly at random from [0.01, 0.03]
Peer influence
probabilities:
CTPs:
34
Experiments
https://github.com/aslayci/TIRM
35. Algorithms Tested
• MYOPIC: Top κu ads for which u has the highest δ(u,i) * CPE(i)
• MYOPIC+: Budget-aware MYOPIC enhancement
• Greedy-IRIE: Instantiation of the Greedy algorithm with IRIE* heuristic
• TIRM:
• ε set to 0.1 for quality experiments on FLIXSTER and EPINIONS
• ε set to 0.2 for scalability experiments on DBLP and LIVEJOURNAL
* K. Jung, W. Heo, and W. Chen, "IRIE: Scalable and Robust Influence Maximization in Social Networks", ICDM 2012 35
Experiments
38. Part III
Social Advertising:
Revenue Maximization
• C. Aslay, F. Bonchi, L. V. Lakshmanan, and W. Lu. “Revenue
Maximization in Incentivized Social Advertising”. Submitted to
International Conference on Very Large Data Bases (VLDB)
2017. (ArXiv e-prints, arXiv: 1612.00531)
39. Incentivized Social Advertising
CPE model with seed user incentives
39
• Advertiser
• Pays a fixed CPE to host for each
engagement
• Pays monetary incentive to each seed
user engaging with his ad
• Total payment subject to his budget
• Host
• Sells ad-engagements to advertisers
• Inserts promoted posts to feed of users in exchange for monetary incentives
• Seed users take a cut on the social advertising revenue
40. Revenue Maximization
• Given
• a social graph G = (V,E)
• TIC propagation model
• h advertisers with budget Bi and CPE(i) for each ad i
• seed user incentives ci(u) for each user u∈V and for each ad i
• Find an allocation S = (S1, …, Sh) that maximizes the
overall revenue of the host from the allocation:
40
41. Theoretical Analysis
• Revenue-Maximization problem is NP-hard
• Restricted special case with h = 1:
• NP-Hard Submodular-Cost Submodular-Knapsack* (SCSK) problem
41*Iyer et al., “Submodular optimization with submodular cover and submodular knapsack constraints”, NIPS 2013.
Partition matroid
Submodular knapsack constraints
• Family 𝘊 of feasible solutions form an Independence System
• Two greedy approximation algorithms w.r.t. sensitivity to seed user
costs during the node selection
42. Theoretical Analysis
• Cost-agnostic greedy algorithm
• Selects (node,ad) pair giving the max. marginal increase in revenue
• Theorem 5.2: Approximation guarantee follows* from 𝘊 forming an
independence system
where
• R and r are, respectively, upper and lower rank of 𝘊
• κπ is the curvature of total revenue function π(.)
42
* Conforti et al., "Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some
generalizations of the Rado-Edmonds theorem.", Discrete Applied Mathematics 1984
43. Theoretical Analysis
• Cost-sensitive greedy algorithm
• Selects the (node,ad) pair giving the max. rate of marginal gain in
revenue per marginal gain in payment
• Theorem 5.3: Approximation guarantee obtained
where
• ρmax and ρmin are, respectively, max. and min. singleton payments
• κρi is the curvature of ad i’s payment function ρi(.)
43
44. Scalable Algorithms
Two-Phase Iterative Revenue Maximization
• Built on the Reverse Influence Sampling framework of TIRM (Part II)
• Latent seed set size estimation
44
• Two-Phase Iterative Cost-Agnostic Revenue Maximization (TI-CARM)
• Two-Phase Iterative Cost-Sensitive Revenue Maximization (TI-CSRM)
45. Datasets and Parameters
TIC EM
Learning
TIC WC
Model
WC
Model
WC
Model
Peer influence
probabilities:
45
Experiments
46. Algorithms Tested
46
Experiments
}• TI-CARM
• TI-CSRM
• PageRank
• For each ad i, select the best candidate user wrt Pagerank ordering
• Among those, select the (user, ad) pair giving maximum marginal increase in the
revenue of the host
• ε set to 0.1 for quality experiments on FLIXSTER and EPINIONS
• ε set to 0.2 for scalability experiments on DBLP and LIVEJOURNAL
51. • Novel problem formulation
• Initiated the investigation of topic-aware influence indexing techniques in
the influence maximization literature
• First step towards enabling online social influence analytics
• Orthogonal to efforts on scalable and efficient influence maximization
algorithms
• Many direct follow ups1,2,3
51
Contributions
Part I - Online Topic-aware Influence Maximization Queries
1 S. Chen et al., "Online Topic-aware Influence Maximization", VLDB 2015
2 Li et al., "Real-time Targeted Influence Maximization for Online Advertisements", VLDB 2015
3 W. Chen et al., "Real-Time Topic-aware Influence Maximization using Preprocessing", ICCS 2015
C. Aslay, N. Barbieri, F. Bonchi, and R. Baeza-Yates. “Online Topic-
aware Influence Maximization Queries”. Published in EDBT 2014.
52. • Initiated the investigation in the area of Social Advertising through the Viral
Marketing lens to address problems that Influence Maximization and
Computational Advertising literature fail to address in isolation
• Introduced novel discrete optimization problem with provable approximation
guarantees
• Introduced TIC-CTP propagation model
• Extended the state-of-the-art influence maximization algorithms for scalable
greedy approximation
• Latent seed set size estimation
• Handling TIC-CTP propagation model
52
Contributions
Part II - Social Advertising: Regret Minimization
C. Aslay, W. Lu, F. Bonchi, A. Goyal, and, L. V. Lakshmanan. “Viral
Marketing Meets Social Advertising: Ad Allocation with Minimum
Regret”. Published in VLDB 2015.
53. 53
Contributions
Part III - Social Advertising: Revenue Maximization
*Iyer et al., “Submodular optimization with submodular cover and submodular knapsack constraints”, NIPS 2013.
• Initiated the investigation in the area of Incentivized Social Advertising through
the Viral Marketing lens
• Introduced novel discrete optimization problem
• Provided cost-agnostic and cost-sensitive approximation guarantees to
submodular function maximization subject to a matroid and multiple
submodular knapsack constraints
• Generalization of the restricted single submodular knapsack version of
the problem (SCSK*)
• Theoretical results also valid for linear knapsack constraints
• = 0 when payment function for ad i is modular
C. Aslay, F. Bonchi, L. V. Lakshmanan, and W. Lu. “Revenue
Maximization in Incentivized Social Advertising”. Submitted to
VLDB 2017. (arXiv: 1612.00531)