4. Generating event streams
Social networking system
data store clients
front-end (application logic)
…
user
social data stores
graph
Two types of user actions
Share anevent
Generate a new event stream
5. Optimizations
Materialized views, one per user
Contain the user’s own events
It can also contain events of the other users it follows
We abstract away application-specific “relevance” filters
All views contain all events stored in them
All queries to a view return all events in the view
VIEW
EVENT Plato 12.00 “Having the shadow of an ideal sandwich”
Hume 12.01 “I just feel a good taste in my mouth”
Kant 12.02 “Dudes, just eat it and stop blabbering”
6. GOAL: Optimizing throughput
Throughput of event stream
Proportional to the amount of data being transferred
Partitioning social graphs is impossible (or at least, very very
hard)
Existing approaches to optimize throughput
Push-all
Pull-all
Hybrid
7. Pull-all
Writes to your view only B
Read from all your friends’ view A
Simpler, good with frequent writes C
WRITE from Alice Data stores READ from Charlie Data stores
Alice Alice
Client Client
Alice Bob Charlie Bob
Charlie Charlie
8. Push-all
Write to all your friends’ views B
Read from your view only A
Good with frequent reads C
WRITEs from Alice Data stores READ from Charlie Data stores
and Bob
Clients Alice Alice
Client
Alice
Bob Charlie Bob
Bob
Charlie Charlie
9. Hybrid
[Silberstein et. al., SIGMOD 2010]
Per-edge choice between pull or push
Uses Production Rate (PR) and Consumption Rate (CR)
Minimum per-edge throughput cost
If PR(A) < CR(B) If PR(A) ≥ CR(B)
A B A B
PUSH PULL
A writes onto B’s view B reads from A’s view
Cost: PR(A) Cost: CR(B)
10. Request schedule
Social networking system
data store clients
front-end (application logic)
…
user
social data stores
graph
Social graph contains the Request Schedule
Per-edge Push or Pull
Easy to integrate in existing system
12. Idea: Social Piggybacking
Two friends are likely to share many common friends
Their views can be used as HUBS to prune edges
SOCIAL PIGGYBACKING
PUSH HUB
A writes new events
onto B’s view B
PULL
C reads events
A by B and A
from B’s view
FREE EDGE! C
Neither pull nor push
13. Social Dissemination Problem
Inputs
Social Graph
Per-node Production and consumption rates
Output: request schedulethat minimizes costs
Each edge needs to be covered
Can be through a hub, push or pull
Requirements
Bounded staleness
Non-triviality
14. Analysis
All admissible request schedule are s.t., for each edge
The edge is served directly, using a push or a pull, or
The edge is served through a hub.
Any other schedule is not admissible
The Social Dissemination problem is NP-hard
15. Nosy: A Simple Heuristic
Nosy looks for hubgraphs
Cost with Piggybacking : PR(X) + CR(Y), cross edges free
16. Nosy Phase 1
Add elements to X sets
X
For each edge (w, y)
Build the largest hubgraph(X, w, y)
Piggybacking cost: PR(X) + CR(y) X w y
Cross edges X ->y are free
Piggyback if cheaper than hybrid
17. Nosy Phase 2
Add elements to Y sets
X Y
For each (w, y)
Let Xybe producers of y that push to
w already X
X w y
Piggybacking cost: CR(y)
Cross edges Xy ->y are free
Piggyback if cheaper than hybrid
19. Experiments
Twitter (Aug 2009) and Flickr (Apr 2008) social graphs
Samples using random walks, which preserve graph
properties
Average sizes
Flickr: 4 k nodes, 112 k edges
Twitter: 25 k nodes, 158 k edges
Production and consumption rates are generated
write:read ratio is 1:5
PR (resp. CR) increases logarithmically with out- degree
(resp. in-degree)
20. Metrics and Results
Metric
Improvement overhybrid optimization (baseline)
Gain(A) = Cost(BASE) / Cost(A) – 1
Results
1. Nosy exploits the community structure
2. It works well under a variety of parameters
21. Clustering Coefficient
After sampling, we keep only a fraction s of edges
B+ is a trivial extension of Baseline
Lock push edges
Pull edges that can be served using hubs are free
More clustering, more gain for Nosy but not for B+
22. Varied Workload
Significant gains
Asymptotically, i.e. with all reads, the per-edge push-
based solution is optimal so the gain tends to zero
23. Effect of Colocation
As the system size grows, the gains reach their maximum
For very small systems there is little communication so
little room for improvements
24. Conclusions
Social Piggybacking is a very promising approach
Baseline has up to 2.4 times higher throughput cost
Easy to integrate in existing systems
Next steps
Run on full social graphs
Evaluate throughput gain on actual social networking system