Network sampling and applications to big data and machine learning

Poster presented at the 2019 Montreal AI Symposium

Network sampling and applications to big data
and machine learning
Antoine Rebecq
Université Paris Nanterre, 200 av. de la République, 92000 Nanterre, FRANCE
Shopify Montréal, 490 rue de la Gauchetière O, Montréal H2Z0B3, CANADA
1. Introduction
Applications of graph (network) data are increasingly popular in the tech industry. Used in many applications, including statistical analyses and machine
learning, graph data processing is often very costly and pose challenges when scaling. Here we investigate different graph sampling algorithms, describe
their efficiency, and showcase their application using a basic recommender system problem. We propose Weighted vertex-induced snowball sampling
(WVISS) as it is found to be more efficient than many competing graph sampling algorithms.
2. Graph topology is very diverse
Real-world graphs often possess difficult-to-
model properties ([1]), including the scale-free
property (distribution of degrees have long tails)
and the small-world property (path lengths are
short). Since the 1990s, probabilistic models
have been created to capture these properties,
such as Barabasi and Albert’s (scale-free) or
Watts and Strogatz’s (small-world).
Real-world graphs are often a combination of
both and thus challenging to model. The follow-
ing graph shows distribution of the log-degree
and path lengths for the Twitter graph ([2]).
They both correspond to the scale-free and
small-world properties:
4. General WVISS efficiency
We measured the precision of WVISS-based esti-
mates on simulations on graphs generated from
probabilistic models and real-world networks.
The following graph shows the design effect of
WVISS based on local clustering (which is the
precision of the estimates obtained compared to
a plan with uniform probabilities of the same
size). A design effect greater than 1 indicates
that it is more cost-effective to run simple sam-
pling than WVISS and vice-versa:
This graph shows that efficiency is difficult to
predict and depends very much on the graph
and the sampling strategy chosen.
3. Weighted vertex-induced snowball sampling
Probabilistic sampling involves selecting n units of a population of size N
at random with (inclusion) probability πk for each k ∈ U. Given a variable of
interest taking values yk, an unbiased estimate of its mean can be computed
using Horvitz-Thompson’s formula: ˆ¯y = 1
N k
yk
πk
. The coefficients wk = 1
πk
are
called sampling weights. Maximum precision is obtained when πk ∝ yk. In
practice, yk is unobserved so inclusion probabilities are computed using some
auxiliary information correlated to yk.
One of the simplest methods for graph sampling is uniform vertex-induced
subgraph sampling, which consists in selecting vertices at random with uniform
probabilities (πk = c), c constant, along with any edge that connects two ver-
tices of the sample. Snowball sampling (unweighted), described in [3], consists
in selecting vertices using uniform probability and then adding all their neigh-
bors (plus the induced vertices) to the sample. Generally, it uses unweighted
estimates.
Weighted vertex-induced snowball sampling (WVISS) works in three
phases: first, a sample of n0 vertices is drawn randomly with a specific strategy
based on external information and/or graph topology. Second, all vertices that
are connected to the vertices selected in the first phase are added to the sample.
Finally, all edges connecting sampled vertices in the initial graph are added,
which finalizes the sample graph. The final sample size n of the sampled graph
is thus random. The graph on the left illustrates the first and the final step of the procedure on
an example graph. All estimates for WVISS computations uses sample weights. We show that the
corresponding Horvitz-Thompson weights can be computed in closed form for each vertex k of the
graph, thus providing unbiased estimates for any mean or total of a linear variable:
wk =
1
1 − j∈Bk
(1 − πj)
where Bk is the set of vertices having an edge pointing to vertex k.
5. Application to machine learning (recommendations)
0.0
0.2
0.4
0.6
0.00 0.05 0.10 0.15 0.20 0.25
Sampling fraction
Top10accuracy
Method
Uniform induced
Unweighted snowball
WVISS
We simulated a graph recommendation problem
using a co-purchases graph of items generated
from a forest-fire model (of order N = 8000 and
fwprobs = 0.15). The goal is to measure the
next purchases of users. Each user who just
purchased object i has a hidden preference for
object j determined by the equation:
(1) preferencej = β1degreej + β2distancei,j
The graph on the left shows the accuracy of
the next 10 purchases prediction (top 10 accu-
racy) for each sampling algorithm for some val-
ues of the sample sizes (expressed in fraction of
N), and β1 = 0.2, β2 = 0.1. WVISS is consis-
tently more accurate than the other sampling
algorithms. Reaching accuracy of 0.5 requires
nearly half as many units with WVISS
than unweighted snowball.
6. References
[1] Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009.
[2] Seth A Myers and Jure Leskovec. The bursty dynamics of the twitter information network. In
Proceedings of the 23rd international conference on World wide web, pages 913–924. ACM, 2014.
[3] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636.
ACM, 2006.

Recommended

Spin models on networks revisited by
Spin models on networks revisitedSpin models on networks revisited
Spin models on networks revisitedPetter Holme
567 views32 slides
SIGNATE 国立国会図書館の画像データレイアウト認識 1st place solution by
SIGNATE 国立国会図書館の画像データレイアウト認識 1st place solutionSIGNATE 国立国会図書館の画像データレイアウト認識 1st place solution
SIGNATE 国立国会図書館の画像データレイアウト認識 1st place solutionKoji Asami
1.5K views6 slides
optimal subsampling by
optimal subsamplingoptimal subsampling
optimal subsamplingTian Tian
118 views1 slide
Comparing 3-D Interpolation Techniques by
Comparing 3-D Interpolation TechniquesComparing 3-D Interpolation Techniques
Comparing 3-D Interpolation TechniquesBinu Enchakalody
2.2K views17 slides
Geometrical Optics QA 2 by
Geometrical Optics QA 2Geometrical Optics QA 2
Geometrical Optics QA 2Lakshmikanta Satapathy
284 views7 slides
Detection of Seam Carving in Uncompressed Images using eXtreme Gradient Boosting by
Detection of Seam Carving in Uncompressed Images using eXtreme Gradient BoostingDetection of Seam Carving in Uncompressed Images using eXtreme Gradient Boosting
Detection of Seam Carving in Uncompressed Images using eXtreme Gradient BoostingIJCSIS Research Publications
78 views5 slides

More Related Content

What's hot

Using Generic Image Processing Operations to Detect a Calibration Grid by
Using Generic Image Processing Operations to Detect a Calibration GridUsing Generic Image Processing Operations to Detect a Calibration Grid
Using Generic Image Processing Operations to Detect a Calibration GridJan Wedekind
785 views13 slides
Digital Differential Analyzer Line Drawing Algorithm by
Digital Differential Analyzer Line Drawing AlgorithmDigital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmKasun Ranga Wijeweera
5.7K views10 slides
Quatum fridge by
Quatum fridgeQuatum fridge
Quatum fridgeJun Steed Huang
39 views20 slides
Application of Integrals by
Application of IntegralsApplication of Integrals
Application of Integralssarcia
3.2K views10 slides
3 d active meshes for cell tracking by
3 d active meshes for cell tracking3 d active meshes for cell tracking
3 d active meshes for cell trackingPrashant Pal
131 views21 slides
2016 SMU Research Day by
2016 SMU Research Day2016 SMU Research Day
2016 SMU Research DayLiu Yang
104 views1 slide

What's hot(20)

Using Generic Image Processing Operations to Detect a Calibration Grid by Jan Wedekind
Using Generic Image Processing Operations to Detect a Calibration GridUsing Generic Image Processing Operations to Detect a Calibration Grid
Using Generic Image Processing Operations to Detect a Calibration Grid
Jan Wedekind785 views
Application of Integrals by sarcia
Application of IntegralsApplication of Integrals
Application of Integrals
sarcia3.2K views
3 d active meshes for cell tracking by Prashant Pal
3 d active meshes for cell tracking3 d active meshes for cell tracking
3 d active meshes for cell tracking
Prashant Pal131 views
2016 SMU Research Day by Liu Yang
2016 SMU Research Day2016 SMU Research Day
2016 SMU Research Day
Liu Yang104 views
Basics of CT- Lecture 9.ppt by Magde Gad
Basics of CT- Lecture 9.pptBasics of CT- Lecture 9.ppt
Basics of CT- Lecture 9.ppt
Magde Gad6 views
Applied numerical methods lec10 by Yasser Ahmed
Applied numerical methods lec10Applied numerical methods lec10
Applied numerical methods lec10
Yasser Ahmed4.8K views
Integration application (Aplikasi Integral) by Muhammad Luthfan
Integration application (Aplikasi Integral)Integration application (Aplikasi Integral)
Integration application (Aplikasi Integral)
Muhammad Luthfan1.5K views
Computer Graphics Modellering engels by Christian Kehl
Computer Graphics Modellering engelsComputer Graphics Modellering engels
Computer Graphics Modellering engels
Christian Kehl504 views
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ... by kagikenco
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...
kagikenco436 views
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput... by atsidaev
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...
atsidaev146 views
MATLAB : Numerical Differention and Integration by Ainul Islam
MATLAB : Numerical Differention and IntegrationMATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and Integration
Ainul Islam2.2K views

Similar to Network sampling and applications to big data and machine learning

論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real... by
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...Ken Sakurada
11.6K views62 slides
Linear regression [Theory and Application (In physics point of view) using py... by
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
68 views8 slides
Performance Improvement of Vector Quantization with Bit-parallelism Hardware by
Performance Improvement of Vector Quantization with Bit-parallelism HardwarePerformance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism HardwareCSCJournals
264 views12 slides
I0343065072 by
I0343065072I0343065072
I0343065072ijceronline
285 views8 slides
Supporting Vector Machine by
Supporting Vector MachineSupporting Vector Machine
Supporting Vector MachineSumit Singh
112 views9 slides
DimensionalityReduction.pptx by
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
7 views60 slides

Similar to Network sampling and applications to big data and machine learning(20)

論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real... by Ken Sakurada
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
Ken Sakurada11.6K views
Linear regression [Theory and Application (In physics point of view) using py... by ANIRBANMAJUMDAR18
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
Performance Improvement of Vector Quantization with Bit-parallelism Hardware by CSCJournals
Performance Improvement of Vector Quantization with Bit-parallelism HardwarePerformance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
CSCJournals264 views
Supporting Vector Machine by Sumit Singh
Supporting Vector MachineSupporting Vector Machine
Supporting Vector Machine
Sumit Singh112 views
DimensionalityReduction.pptx by 36rajneekant
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
36rajneekant7 views
Machine learning (11) by NYversity
Machine learning (11)Machine learning (11)
Machine learning (11)
NYversity192 views
Error analysis statistics by Tarun Gehlot
Error analysis   statisticsError analysis   statistics
Error analysis statistics
Tarun Gehlot6.5K views
Introduction geostatistic for_mineral_resources by Adi Handarbeni
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
Adi Handarbeni3.2K views
HOP-Rec_RecSys18 by Matt Yang
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
Matt Yang106 views
Real time implementation of unscented kalman filter for target tracking by IAEME Publication
Real time implementation of unscented kalman filter for target trackingReal time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target tracking
IAEME Publication963 views
Markov Chain Monitoring - Application to demand prediction in bike sharing sy... by Harshal Chaudhari
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Vector-Based Back Propagation Algorithm of.pdf by Nesrine Wagaa
Vector-Based Back Propagation Algorithm of.pdfVector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdf
Nesrine Wagaa10 views

More from Antoine Rebecq

Bring survey sampling techniques into big data by
Bring survey sampling techniques into big dataBring survey sampling techniques into big data
Bring survey sampling techniques into big dataAntoine Rebecq
2.5K views29 slides
Sampling graphs efficiently - MAD Stat (TSE) by
Sampling graphs efficiently - MAD Stat (TSE)Sampling graphs efficiently - MAD Stat (TSE)
Sampling graphs efficiently - MAD Stat (TSE)Antoine Rebecq
1.6K views82 slides
Tirage spatialement équilibré - INSEE by
Tirage spatialement équilibré - INSEETirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEEAntoine Rebecq
1.3K views28 slides
Optimisation d'une allocation mixte by
Optimisation d'une allocation mixteOptimisation d'une allocation mixte
Optimisation d'une allocation mixteAntoine Rebecq
1.1K views28 slides
Calage sur bornes minimales by
Calage sur bornes minimalesCalage sur bornes minimales
Calage sur bornes minimalesAntoine Rebecq
1.3K views34 slides
Sampling methods for graphs by
Sampling methods for graphsSampling methods for graphs
Sampling methods for graphsAntoine Rebecq
5.9K views68 slides

More from Antoine Rebecq(7)

Bring survey sampling techniques into big data by Antoine Rebecq
Bring survey sampling techniques into big dataBring survey sampling techniques into big data
Bring survey sampling techniques into big data
Antoine Rebecq2.5K views
Sampling graphs efficiently - MAD Stat (TSE) by Antoine Rebecq
Sampling graphs efficiently - MAD Stat (TSE)Sampling graphs efficiently - MAD Stat (TSE)
Sampling graphs efficiently - MAD Stat (TSE)
Antoine Rebecq1.6K views
Tirage spatialement équilibré - INSEE by Antoine Rebecq
Tirage spatialement équilibré - INSEETirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEE
Antoine Rebecq1.3K views
Optimisation d'une allocation mixte by Antoine Rebecq
Optimisation d'une allocation mixteOptimisation d'une allocation mixte
Optimisation d'une allocation mixte
Antoine Rebecq1.1K views
Calage sur bornes minimales by Antoine Rebecq
Calage sur bornes minimalesCalage sur bornes minimales
Calage sur bornes minimales
Antoine Rebecq1.3K views
Sampling methods for graphs by Antoine Rebecq
Sampling methods for graphsSampling methods for graphs
Sampling methods for graphs
Antoine Rebecq5.9K views
Sampling the Twitter graph by Antoine Rebecq
Sampling the Twitter graphSampling the Twitter graph
Sampling the Twitter graph
Antoine Rebecq1.6K views

Recently uploaded

Light Pollution for LVIS students by
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS studentsCWBarthlmew
5 views12 slides
Conventional and non-conventional methods for improvement of cucurbits.pptx by
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptxgandhi976
18 views35 slides
A training, certification and marketing scheme for informal dairy vendors in ... by
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...ILRI
11 views13 slides
"How can I develop my learning path in bioinformatics? by
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?Bioinformy
21 views13 slides
plasmids by
plasmidsplasmids
plasmidsscribddarkened352
7 views2 slides
scopus cited journals.pdf by
scopus cited journals.pdfscopus cited journals.pdf
scopus cited journals.pdfKSAravindSrivastava
5 views15 slides

Recently uploaded(20)

Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew5 views
Conventional and non-conventional methods for improvement of cucurbits.pptx by gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97618 views
A training, certification and marketing scheme for informal dairy vendors in ... by ILRI
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...
ILRI11 views
"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy21 views
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez121 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
Guinea Pig as a Model for Translation Research by PervaizDar1
Guinea Pig as a Model for Translation ResearchGuinea Pig as a Model for Translation Research
Guinea Pig as a Model for Translation Research
PervaizDar111 views
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by abhinashsahoo2001
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptx
abhinashsahoo2001118 views
RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen135 views
PRINCIPLES-OF ASSESSMENT by rbalmagro
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENT
rbalmagro11 views
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... by GIFT KIISI NKIN
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
GIFT KIISI NKIN17 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
How to be(come) a successful PhD student by Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens460 views

Network sampling and applications to big data and machine learning

  • 1. Network sampling and applications to big data and machine learning Antoine Rebecq Université Paris Nanterre, 200 av. de la République, 92000 Nanterre, FRANCE Shopify Montréal, 490 rue de la Gauchetière O, Montréal H2Z0B3, CANADA 1. Introduction Applications of graph (network) data are increasingly popular in the tech industry. Used in many applications, including statistical analyses and machine learning, graph data processing is often very costly and pose challenges when scaling. Here we investigate different graph sampling algorithms, describe their efficiency, and showcase their application using a basic recommender system problem. We propose Weighted vertex-induced snowball sampling (WVISS) as it is found to be more efficient than many competing graph sampling algorithms. 2. Graph topology is very diverse Real-world graphs often possess difficult-to- model properties ([1]), including the scale-free property (distribution of degrees have long tails) and the small-world property (path lengths are short). Since the 1990s, probabilistic models have been created to capture these properties, such as Barabasi and Albert’s (scale-free) or Watts and Strogatz’s (small-world). Real-world graphs are often a combination of both and thus challenging to model. The follow- ing graph shows distribution of the log-degree and path lengths for the Twitter graph ([2]). They both correspond to the scale-free and small-world properties: 4. General WVISS efficiency We measured the precision of WVISS-based esti- mates on simulations on graphs generated from probabilistic models and real-world networks. The following graph shows the design effect of WVISS based on local clustering (which is the precision of the estimates obtained compared to a plan with uniform probabilities of the same size). A design effect greater than 1 indicates that it is more cost-effective to run simple sam- pling than WVISS and vice-versa: This graph shows that efficiency is difficult to predict and depends very much on the graph and the sampling strategy chosen. 3. Weighted vertex-induced snowball sampling Probabilistic sampling involves selecting n units of a population of size N at random with (inclusion) probability πk for each k ∈ U. Given a variable of interest taking values yk, an unbiased estimate of its mean can be computed using Horvitz-Thompson’s formula: ˆ¯y = 1 N k yk πk . The coefficients wk = 1 πk are called sampling weights. Maximum precision is obtained when πk ∝ yk. In practice, yk is unobserved so inclusion probabilities are computed using some auxiliary information correlated to yk. One of the simplest methods for graph sampling is uniform vertex-induced subgraph sampling, which consists in selecting vertices at random with uniform probabilities (πk = c), c constant, along with any edge that connects two ver- tices of the sample. Snowball sampling (unweighted), described in [3], consists in selecting vertices using uniform probability and then adding all their neigh- bors (plus the induced vertices) to the sample. Generally, it uses unweighted estimates. Weighted vertex-induced snowball sampling (WVISS) works in three phases: first, a sample of n0 vertices is drawn randomly with a specific strategy based on external information and/or graph topology. Second, all vertices that are connected to the vertices selected in the first phase are added to the sample. Finally, all edges connecting sampled vertices in the initial graph are added, which finalizes the sample graph. The final sample size n of the sampled graph is thus random. The graph on the left illustrates the first and the final step of the procedure on an example graph. All estimates for WVISS computations uses sample weights. We show that the corresponding Horvitz-Thompson weights can be computed in closed form for each vertex k of the graph, thus providing unbiased estimates for any mean or total of a linear variable: wk = 1 1 − j∈Bk (1 − πj) where Bk is the set of vertices having an edge pointing to vertex k. 5. Application to machine learning (recommendations) 0.0 0.2 0.4 0.6 0.00 0.05 0.10 0.15 0.20 0.25 Sampling fraction Top10accuracy Method Uniform induced Unweighted snowball WVISS We simulated a graph recommendation problem using a co-purchases graph of items generated from a forest-fire model (of order N = 8000 and fwprobs = 0.15). The goal is to measure the next purchases of users. Each user who just purchased object i has a hidden preference for object j determined by the equation: (1) preferencej = β1degreej + β2distancei,j The graph on the left shows the accuracy of the next 10 purchases prediction (top 10 accu- racy) for each sampling algorithm for some val- ues of the sample sizes (expressed in fraction of N), and β1 = 0.2, β2 = 0.1. WVISS is consis- tently more accurate than the other sampling algorithms. Reaching accuracy of 0.5 requires nearly half as many units with WVISS than unweighted snowball. 6. References [1] Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009. [2] Seth A Myers and Jure Leskovec. The bursty dynamics of the twitter information network. In Proceedings of the 23rd international conference on World wide web, pages 913–924. ACM, 2014. [3] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636. ACM, 2006.