Network sampling and applications to big data
and machine learning
Antoine Rebecq
Université Paris Nanterre, 200 av. de la République, 92000 Nanterre, FRANCE
Shopify Montréal, 490 rue de la GauchetiÚre O, Montréal H2Z0B3, CANADA
1. Introduction
Applications of graph (network) data are increasingly popular in the tech industry. Used in many applications, including statistical analyses and machine
learning, graph data processing is often very costly and pose challenges when scaling. Here we investigate diïŹ€erent graph sampling algorithms, describe
their eïŹƒciency, and showcase their application using a basic recommender system problem. We propose Weighted vertex-induced snowball sampling
(WVISS) as it is found to be more eïŹƒcient than many competing graph sampling algorithms.
2. Graph topology is very diverse
Real-world graphs often possess diïŹƒcult-to-
model properties ([1]), including the scale-free
property (distribution of degrees have long tails)
and the small-world property (path lengths are
short). Since the 1990s, probabilistic models
have been created to capture these properties,
such as Barabasi and Albert’s (scale-free) or
Watts and Strogatz’s (small-world).
Real-world graphs are often a combination of
both and thus challenging to model. The follow-
ing graph shows distribution of the log-degree
and path lengths for the Twitter graph ([2]).
They both correspond to the scale-free and
small-world properties:
4. General WVISS eïŹƒciency
We measured the precision of WVISS-based esti-
mates on simulations on graphs generated from
probabilistic models and real-world networks.
The following graph shows the design eïŹ€ect of
WVISS based on local clustering (which is the
precision of the estimates obtained compared to
a plan with uniform probabilities of the same
size). A design eïŹ€ect greater than 1 indicates
that it is more cost-eïŹ€ective to run simple sam-
pling than WVISS and vice-versa:
This graph shows that eïŹƒciency is diïŹƒcult to
predict and depends very much on the graph
and the sampling strategy chosen.
3. Weighted vertex-induced snowball sampling
Probabilistic sampling involves selecting n units of a population of size N
at random with (inclusion) probability πk for each k ∈ U. Given a variable of
interest taking values yk, an unbiased estimate of its mean can be computed
using Horvitz-Thompson’s formula: ˆ¯y = 1
N k
yk
πk
. The coeïŹƒcients wk = 1
πk
are
called sampling weights. Maximum precision is obtained when πk ∝ yk. In
practice, yk is unobserved so inclusion probabilities are computed using some
auxiliary information correlated to yk.
One of the simplest methods for graph sampling is uniform vertex-induced
subgraph sampling, which consists in selecting vertices at random with uniform
probabilities (πk = c), c constant, along with any edge that connects two ver-
tices of the sample. Snowball sampling (unweighted), described in [3], consists
in selecting vertices using uniform probability and then adding all their neigh-
bors (plus the induced vertices) to the sample. Generally, it uses unweighted
estimates.
Weighted vertex-induced snowball sampling (WVISS) works in three
phases: ïŹrst, a sample of n0 vertices is drawn randomly with a speciïŹc strategy
based on external information and/or graph topology. Second, all vertices that
are connected to the vertices selected in the ïŹrst phase are added to the sample.
Finally, all edges connecting sampled vertices in the initial graph are added,
which ïŹnalizes the sample graph. The ïŹnal sample size n of the sampled graph
is thus random. The graph on the left illustrates the ïŹrst and the ïŹnal step of the procedure on
an example graph. All estimates for WVISS computations uses sample weights. We show that the
corresponding Horvitz-Thompson weights can be computed in closed form for each vertex k of the
graph, thus providing unbiased estimates for any mean or total of a linear variable:
wk =
1
1 − j∈Bk
(1 − πj)
where Bk is the set of vertices having an edge pointing to vertex k.
5. Application to machine learning (recommendations)
0.0
0.2
0.4
0.6
0.00 0.05 0.10 0.15 0.20 0.25
Sampling fraction
Top10accuracy
Method
Uniform induced
Unweighted snowball
WVISS
We simulated a graph recommendation problem
using a co-purchases graph of items generated
from a forest-ïŹre model (of order N = 8000 and
fwprobs = 0.15). The goal is to measure the
next purchases of users. Each user who just
purchased object i has a hidden preference for
object j determined by the equation:
(1) preferencej = ÎČ1degreej + ÎČ2distancei,j
The graph on the left shows the accuracy of
the next 10 purchases prediction (top 10 accu-
racy) for each sampling algorithm for some val-
ues of the sample sizes (expressed in fraction of
N), and ÎČ1 = 0.2, ÎČ2 = 0.1. WVISS is consis-
tently more accurate than the other sampling
algorithms. Reaching accuracy of 0.5 requires
nearly half as many units with WVISS
than unweighted snowball.
6. References
[1] Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009.
[2] Seth A Myers and Jure Leskovec. The bursty dynamics of the twitter information network. In
Proceedings of the 23rd international conference on World wide web, pages 913–924. ACM, 2014.
[3] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636.
ACM, 2006.

Network sampling and applications to big data and machine learning

  • 1.
    Network sampling andapplications to big data and machine learning Antoine Rebecq UniversitĂ© Paris Nanterre, 200 av. de la RĂ©publique, 92000 Nanterre, FRANCE Shopify MontrĂ©al, 490 rue de la GauchetiĂšre O, MontrĂ©al H2Z0B3, CANADA 1. Introduction Applications of graph (network) data are increasingly popular in the tech industry. Used in many applications, including statistical analyses and machine learning, graph data processing is often very costly and pose challenges when scaling. Here we investigate diïŹ€erent graph sampling algorithms, describe their eïŹƒciency, and showcase their application using a basic recommender system problem. We propose Weighted vertex-induced snowball sampling (WVISS) as it is found to be more eïŹƒcient than many competing graph sampling algorithms. 2. Graph topology is very diverse Real-world graphs often possess diïŹƒcult-to- model properties ([1]), including the scale-free property (distribution of degrees have long tails) and the small-world property (path lengths are short). Since the 1990s, probabilistic models have been created to capture these properties, such as Barabasi and Albert’s (scale-free) or Watts and Strogatz’s (small-world). Real-world graphs are often a combination of both and thus challenging to model. The follow- ing graph shows distribution of the log-degree and path lengths for the Twitter graph ([2]). They both correspond to the scale-free and small-world properties: 4. General WVISS eïŹƒciency We measured the precision of WVISS-based esti- mates on simulations on graphs generated from probabilistic models and real-world networks. The following graph shows the design eïŹ€ect of WVISS based on local clustering (which is the precision of the estimates obtained compared to a plan with uniform probabilities of the same size). A design eïŹ€ect greater than 1 indicates that it is more cost-eïŹ€ective to run simple sam- pling than WVISS and vice-versa: This graph shows that eïŹƒciency is diïŹƒcult to predict and depends very much on the graph and the sampling strategy chosen. 3. Weighted vertex-induced snowball sampling Probabilistic sampling involves selecting n units of a population of size N at random with (inclusion) probability πk for each k ∈ U. Given a variable of interest taking values yk, an unbiased estimate of its mean can be computed using Horvitz-Thompson’s formula: ˆ¯y = 1 N k yk πk . The coeïŹƒcients wk = 1 πk are called sampling weights. Maximum precision is obtained when πk ∝ yk. In practice, yk is unobserved so inclusion probabilities are computed using some auxiliary information correlated to yk. One of the simplest methods for graph sampling is uniform vertex-induced subgraph sampling, which consists in selecting vertices at random with uniform probabilities (πk = c), c constant, along with any edge that connects two ver- tices of the sample. Snowball sampling (unweighted), described in [3], consists in selecting vertices using uniform probability and then adding all their neigh- bors (plus the induced vertices) to the sample. Generally, it uses unweighted estimates. Weighted vertex-induced snowball sampling (WVISS) works in three phases: ïŹrst, a sample of n0 vertices is drawn randomly with a speciïŹc strategy based on external information and/or graph topology. Second, all vertices that are connected to the vertices selected in the ïŹrst phase are added to the sample. Finally, all edges connecting sampled vertices in the initial graph are added, which ïŹnalizes the sample graph. The ïŹnal sample size n of the sampled graph is thus random. The graph on the left illustrates the ïŹrst and the ïŹnal step of the procedure on an example graph. All estimates for WVISS computations uses sample weights. We show that the corresponding Horvitz-Thompson weights can be computed in closed form for each vertex k of the graph, thus providing unbiased estimates for any mean or total of a linear variable: wk = 1 1 − j∈Bk (1 − πj) where Bk is the set of vertices having an edge pointing to vertex k. 5. Application to machine learning (recommendations) 0.0 0.2 0.4 0.6 0.00 0.05 0.10 0.15 0.20 0.25 Sampling fraction Top10accuracy Method Uniform induced Unweighted snowball WVISS We simulated a graph recommendation problem using a co-purchases graph of items generated from a forest-ïŹre model (of order N = 8000 and fwprobs = 0.15). The goal is to measure the next purchases of users. Each user who just purchased object i has a hidden preference for object j determined by the equation: (1) preferencej = ÎČ1degreej + ÎČ2distancei,j The graph on the left shows the accuracy of the next 10 purchases prediction (top 10 accu- racy) for each sampling algorithm for some val- ues of the sample sizes (expressed in fraction of N), and ÎČ1 = 0.2, ÎČ2 = 0.1. WVISS is consis- tently more accurate than the other sampling algorithms. Reaching accuracy of 0.5 requires nearly half as many units with WVISS than unweighted snowball. 6. References [1] Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009. [2] Seth A Myers and Jure Leskovec. The bursty dynamics of the twitter information network. In Proceedings of the 23rd international conference on World wide web, pages 913–924. ACM, 2014. [3] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636. ACM, 2006.