Sampling from Massive Graph Streams: A Unifying Framework

Joint Work with:
Ryan RossiNick Duffield Ted willke

• Streaming Graph Algorithms
• Unbiased estimation, sketching, summarization, bipartite graph projection
• Relational Learning
• Sampling, inference, representation learning, graph embeddings, role discovery,
higher-order models, graph kernels, graph convolutions
• Parallel Graph Algorithms
• Subgraph counting, graphlet decomposition, approximations, coloring, kcore/ktruss
• Applications
• Real-time graph mining, interactive visual graph analytics, higher-order network
analysis, brain connectivity
Nesreen Ahmed – Intel Labs
Large-Scale Relational Learning & Graph Mining
Theory, Algorithms, and Applications
……

(1) Challenges for streaming graph analysis
(2) A framework for sampling/summarization of massive
streaming graphs

⋯ ⋯
Graphs are naturally dynamic & streaming over time
Streaming graphs are continuously growing -> massive in size

⋯ ⋯
t-p t-1 t
⋯
⋯
⋯
⋯
Graphs are naturally dynamic & streaming over time
Streaming graphs are continuously growing -> massive in size

t=1 t=2
[1, 5] [6, 10]
Discrete-time models: represent dynamic network as a sequence
of static snapshot graphs where
User-defined aggregation time-interval
Very coarse representation with
noise/error problems
Difficult to manage at a large scale
Streaming Model

Due to these challenges, we usually need to sample
Sampling/
Summarization
Studying and analyzing streaming complex graphs
is a challenging and computationally intensive task
q It is not always possible to store the data in full
q It is faster/convenient to work with a compact summary
⋯ ⋯
Sample/summary
Graph (S)
Graph Stream

Data/ML
Algorithms
Sample/summary
Graph (S)
Model Learning
Study Network
Structure
Network Parameter
Estimation
Feature Representation
.
.
.
Turn large data streams into smaller manageable data
- Speedup queries, analysis, and savings in storage
Approximate
Query Processing

§ Other Alternatives
• Sketching
• dimensionality reduction
• dictionary-based summarization
All worthy of studying – but not the focus of this talk

§ Sampling has an intuitive semantics
§ Statistical estimation on a sample is often straightforward
• Run the analysis on the sample that you would on the full data
• Some rescaling/reweighting may be necessary
§ Sampling is general/agnostic to the analysis to be done
• can be tuned to optimize some criteria
§ Sampling is (usually) easy to understand

Data Characteristics
Heavy Tailed distribution,
Correlations, clusters, rare events
Query Requirements
Accuracy, Aggregates,
Speed, privacy
Resource Constraints
Bandwidth, Storage,,
access constraints
Sampling
Study Goals
Parameter Estimation
Data Collection
Learning a Model
Sample/summary

Given a large graph G represented as a stream of edges e1,e2, e3…
We show how to efficiently sample from G with limited memory
budget to calculate unbiased estimates of various graph properties

Frequent connected subsets of edges
Transitivity
No. TrianglesNo. Wedges
Given a large graph G represented as a stream of edges e1,e2, e3…
We show how to efficiently sample from G with limited memory
budget to calculate unbiased estimates of various graph properties

§ Take a single linear scan of the edge stream to draw a sample
• Streaming model of computation: see each element once
• Flip a coin with probability p for each edge
• If head, edge is added to the sample
• Variable sample size
• Problems with handling heavy-tailed distributions

“Reservoir sampling” described by [Knuth 69, 81]; enhancements
[Vitter 85]
§ Fixed size k uniform sample from arbitrary size N stream in one
pass
§ No need to know stream size in advance
§ Include first k items w.p. 1
§ Include item n > k with probability p(n) = k/n, n > k
— Pick j uniformly from {1,2,…,n}
— If j ≤ k, swap item n into location j in reservoir, discard
replaced item
Easy to prove the uniformity of the sampling method

§ Single-pass algorithms for arbitrary-ordered graph streams
• Streaming-Triangles – [Jha et al. KDD’13]
— Sample edges using reservoir sampling, then sample pairs of incident
edges (wedges), and finally scan for closed wedges (triangles)
• Neighborhood Sampling – [Pavan et al. VLDB’13]
— Sampling vectors of wedge estimators, scan the stream for closed wedges
(triangles)
• TRIEST– [De Stefani et al. KDD’16]
— Uses standard reservoir sampling to maintain the edge sample
• MASCOT– [Lim et al. KDD’15]
— Bernoulli edge sampling with probability p
• Graph Sample & Hold– [Ahmed et al. KDD’14]
— Weighted conditionally independent edge sampling

Focus of previous work
Sampling designs for specific graph properties (triangles)
Not generally applicable to other properties
Uniform-based Sampling
Obtain variable-size sample

Focus of previous work
Sampling designs for specific graph properties (triangles)
Not generally applicable to other properties
Uniform-based Sampling
Obtain variable-size sample
Our focus - Graph Priority Sampling
• Weight-sensitive
• Fixed-size sample
• Single-pass
• Applicable for general graph properties
• Can incorporate auxiliary variables

streaming graphs
✓
[Ahmed et al., VLDB 2017], [Ahmed et al., IJCAI 2018]

§ Order sampling a.k.a. bottom-k sample, min-hashing
§ Uniform sampling of stream into reservoir of size k
§ Each arrival n: generate one-time random value rn Î U[0,1]
• rn also known as hash, rank, tag…
§ Store k items with the smallest random tags
§ Each item has same chance of least tag, so still uniform
0.391 0.908 0.291 0.555 0.619 0.273

Input
Graph Priority Sampling Framework
GPS(m)
Output
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)

Input
GPS(m)
Output
For each edge k
Generate a random number
u(k) ⇠ Uni(0, 1]
Edge stream
k1, k2, ..., k, ...
Compute edge weight
w(k) = W(k, ˆK)
Compute edge priority
r(k) = w(k)/u(k)
ˆK = ˆK [ {k}

Input
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Find edge with lowest priority
k⇤
= arg mink02 ˆK r(k0
)
Update sample threshold
z⇤
= max{z⇤
, r(k⇤
)}
Remove lowest priority edge
ˆK = ˆK{k⇤
}
Use a priority queue with O(log m) updates

§ We use edge weights to express the role of the arriving
edge in the sample/summary graph
• e.g., subgraphs completed by the arriving edge
• other auxiliary variables
§ Computational feasibility
• Efficient implementation by using a priority queue
• Implemented as a Min-heap with O(log m) insertion/deletion
• O(1) access to the edge with minimum priority
w(k) = W(k, ˆK)
r(k) = w(k)/u(k)

For each edge i,
we construct a sequence of edge estimators ˆSi,t
We achieve unbiasedness by
establishing that the sequence is a Martingale (Theorem 1)
E[ ˆSi,t] = Si,t
ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤
}
where ˆSi,t are unbiased estimators of the corresponding edge
ˆKt is the sample at time t
Edge Estimation
[Ahmed et al, VLDB 2017]

For each subgraph J ⇢ [t],
we deﬁne the sequence of subgraph estimators as
ˆSJ,t =
Q
i2J
ˆSi,t
E[ ˆSJ,t] = SJ,t
We prove the sequence is a Martingale (Theorem 2)
Subgraph Estimation

Subgraph Counting
For any set J of subgraphs of G,
ˆNt(J ) =
P
J2J :J⇢Kt
ˆSJ,t
is an unbiased estimator of Nt(J ) = |Jt|
(Theorem 2)

§ We provide a cost minimization approach
• inspired by IPPS sampling in i.i.d data [Cohen et al. 2005]
§ By minimizing the conditional variance of the increment
incurred by the arriving edge in
How the ranks ri,t should be distributed in order to minimize
the variance of the unbiased estimator of Nt(J )?
Nt(J )

§ Post-stream Estimation
• Constructs a reference sample for retrospective queries
• after any number t of edge arrivals have taken place, we can
compute an unbiased estimate for the count of arbitrary subgraphs
§ In-stream Estimation
• we can take “snapshots” of estimates of specific sampled subgraphs
at arbitrary times during the stream
• Still Unbiased!
• Lightweight online/incremental update of unbiased estimates of
subgraph counts
• Same sampling procedure

Input
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
r(k) = w(k)/u(k)
Update the sample
Update unbiased estimates
of subgraph counts

In-stream Estimation
We deﬁne a snapshot as an edge subset J, with a family of
stopping times T such that T = {Tj : j 2 J}
We prove the sequence is a stopped Martingale (Theorem 4)
ˆST
J,t =
Q
j2J
ˆS
Tj
j,t =
Q
j2J
ˆSj,min{Tj ,t}
E[ ˆST
J,t] = SJ,t

Sample (S)
Overlapping
Triangles
Non-
Overlapping
Triangles
Multiple stopping
times for the
common edge

q Different weighting schemes lead to
different algorithms
- Uniform weights -> uniform sampling
q Ability to incorporate auxiliary variables
- Using the weight function
q Post-stream estimation
- construction of reference samples
for retrospective queries
q In-stream Estimation
- Maintains the desired query answer
(/variance) and update it accordingly
-> in a sketching fashion
For each edge k
Generate a random number
u(k) ⇠ Uni(0, 1]
Compute edge weight
w(k) = W(k, ˆK)
r(k) = w(k)/u(k)

§ We use GPS for the estimation of
• Triangle counts
• Wedge counts
• Global clustering coefficient
• And their unbiased variance (Theorem 3 in the paper)
• Weight function
• Used a large set of graphs from a variety of domains (social, we,
tech, etc) - data is available on http://networkrepository.com/
— Up to 49B edges
W(k, ˆK) = 9 ⇤ ˆ4(k) + 1
where ˆ4(k) is the number of triangles
completed by edge k and whose edges in ˆK

- GPS accurately estimates various properties simultaneously
- Consistent performance across graphs from various domains
- A key advantage for GPS in-stream has smaller variance and tight confidence bounds

Results for triangle counts
Using massive real-world and synthetic graphs of up to 49B edges
GPS is shown to be accurate with <0.01 error
Sample size = 1M edges, in-stream estimation
95% confidence intervals

10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
soc−twitter−2010
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Sample Size |K|
x/x
Global Clustering Coeff
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Sample Size |K|
x/x
10
4
10
5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Sample Size |K|
x/x
Triangle Count
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Sample Size |K|
x/x
10
4
10
5
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Sample Size |K|
x/x
Wedge Count
Actual
Estimated/Actual
Confidence Upper & Lower Bounds
Sample Size = 40K edges
Accurate estimates for large Twitter graph ~ 265M edges, and 17.2B triangles

Global Clustering CoeffTriangle Count Wedge Count
Actual
Estimated/Actual
Confidence Upper & Lower Bounds
Sample Size = 40K edges
Accurate estimates for large social network Orkut ~ 120M edges, and 630M triangles
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x
10
4
10
5
10
6
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
soc−orkut
Sample Size |K|
x/x

0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Trianglesattimet(xt)
soc−orkut
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
ClusteringCoeﬀ.attimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
ClusteringCoeﬀ.attimet(xt)
soc−orkut
GPS in-stream estimates over time
Sample size = 80K edges

0.994 0.996 0.998 1 1.002 1.004 1.006
0.994
0.996
0.998
1
1.002
1.004
1.006
ca-hollywood-2009
com-amazon
higgs-social-network
soc-flickr
soc-youtube-snap
socfb-Indiana69
socfb-Penn94
socfb-Texas84
socfb-UF21
tech-as-skitter
web-BerkStan
web-google
GPS In-stream Estimation, sample size 100K edges
GPS accurately estimates both triangle and wedge counts
simultaneously with a single sample

We observe accurate results with no significant difference in error between
the ordering schemes

§ We used three schemes for weighting edges during sampling
§ Goal: estimate triangle counts for Friendster social network
with sample size=1M (0.1% of the graph)
1. triangle-based weights (3% relative error)
2. wedge-based weights (25% relative error)
3. uniform weights for all incoming edges (43% relative error)
- this is equivalent to simple random sampling
The estimator variance was 3.8x higher using wedge-based weights, and
6.2x higher using uniform weights compared to triangle-based weights.

streaming graphs
✓
✓
[Ahmed et al., VLDB 2017], [Ahmed et al., IJCAI 2018]

§ Queries Beyond triangles
• Higher-order subgraphs
§ Streaming bipartite network projection
§ Approximate one-mode bipartite graph projection
§ To estimate similarity among one set of the nodes
§ Adaptive sampling
• Adaptive weights vs fixed weight
• Insertion/deletion streams – other dynamics
§ Batch computations, libraries …
To appear [Ahmed et al., IJCAI 2018]
To appear [Ahmed et al., IJCAI 2018]

§ A sample is representative if graph properties of interest can be
estimated with a known degree of accuracy
§ Graph Priority Sampling (GPS) – unifying framework
- GPS is an efficient single-pass streaming framework
§ GPS is general and agnostic to the desired query
• Allows the dependence of the sampling process as a function of the stored
state and/or auxiliary variables
§ GPS is variance minimizing sampling approach
§ GPS has a relative estimation error < 1%

§ On sampling from massive graph streams. VLDB 2017, [Ahmed et al.]
§ Sampling for Bipartite Network Projection. IJCAI 2018, [Ahmed et al.]
§ A space efficient streaming algorithm for triangle counting using the birthday paradox. KDD 2013, [Jha et. al]
§ Counting and Sampling Triangles from a Graph Stream. VLDB 2013. {Pavan et. al]
§ Efficient Graphlet Counting for Large Networks. ICDM 2015, [Ahmed et al.]
§ Graphlet Decomposition: Framework, Algorithms, and Applications. J. Know. & Info. 2016 [Ahmed et al.]
§ Estimation of Graphlet Counts in Massive Networks. IEEE TNNLS 2018 [Rossi-Zhou-Ahmed]
§ MASCOT: Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams. KDD 2015. [Lim et. al]
§ Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, [Milo et al.]
§ Graph Sample and Hold: A Framework for Big Graph Analytics. KDD 2014 [Ahmed-Duffield-Neville-Kompella]
§ Role Discovery in Networks. IEEE TKDE 2015 [Rossi-Ahmed]
§ Efficient semi-streaming algorithms for local triangle counting in massive graphs. KDD 2008 [Becchetti et al.]
§ Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 1985. [Vitter]

Thank you!
Questions?
nesreen.k.ahmed@intel.com
http://nesreenahmed.com

Sampling from Massive Graph Streams: A Unifying Framework

More Related Content

What's hot

Similar to Sampling from Massive Graph Streams: A Unifying Framework

Recently uploaded

Sampling from Massive Graph Streams: A Unifying Framework