SlideShare a Scribd company logo
1 of 1
Download to read offline
Network sampling and applications to big data
and machine learning
Antoine Rebecq
Université Paris Nanterre, 200 av. de la République, 92000 Nanterre, FRANCE
Shopify Montréal, 490 rue de la Gauchetière O, Montréal H2Z0B3, CANADA
1. Introduction
Applications of graph (network) data are increasingly popular in the tech industry. Used in many applications, including statistical analyses and machine
learning, graph data processing is often very costly and pose challenges when scaling. Here we investigate different graph sampling algorithms, describe
their efficiency, and showcase their application using a basic recommender system problem. We propose Weighted vertex-induced snowball sampling
(WVISS) as it is found to be more efficient than many competing graph sampling algorithms.
2. Graph topology is very diverse
Real-world graphs often possess difficult-to-
model properties ([1]), including the scale-free
property (distribution of degrees have long tails)
and the small-world property (path lengths are
short). Since the 1990s, probabilistic models
have been created to capture these properties,
such as Barabasi and Albert’s (scale-free) or
Watts and Strogatz’s (small-world).
Real-world graphs are often a combination of
both and thus challenging to model. The follow-
ing graph shows distribution of the log-degree
and path lengths for the Twitter graph ([2]).
They both correspond to the scale-free and
small-world properties:
4. General WVISS efficiency
We measured the precision of WVISS-based esti-
mates on simulations on graphs generated from
probabilistic models and real-world networks.
The following graph shows the design effect of
WVISS based on local clustering (which is the
precision of the estimates obtained compared to
a plan with uniform probabilities of the same
size). A design effect greater than 1 indicates
that it is more cost-effective to run simple sam-
pling than WVISS and vice-versa:
This graph shows that efficiency is difficult to
predict and depends very much on the graph
and the sampling strategy chosen.
3. Weighted vertex-induced snowball sampling
Probabilistic sampling involves selecting n units of a population of size N
at random with (inclusion) probability πk for each k ∈ U. Given a variable of
interest taking values yk, an unbiased estimate of its mean can be computed
using Horvitz-Thompson’s formula: ˆ¯y = 1
N k
yk
πk
. The coefficients wk = 1
πk
are
called sampling weights. Maximum precision is obtained when πk ∝ yk. In
practice, yk is unobserved so inclusion probabilities are computed using some
auxiliary information correlated to yk.
One of the simplest methods for graph sampling is uniform vertex-induced
subgraph sampling, which consists in selecting vertices at random with uniform
probabilities (πk = c), c constant, along with any edge that connects two ver-
tices of the sample. Snowball sampling (unweighted), described in [3], consists
in selecting vertices using uniform probability and then adding all their neigh-
bors (plus the induced vertices) to the sample. Generally, it uses unweighted
estimates.
Weighted vertex-induced snowball sampling (WVISS) works in three
phases: first, a sample of n0 vertices is drawn randomly with a specific strategy
based on external information and/or graph topology. Second, all vertices that
are connected to the vertices selected in the first phase are added to the sample.
Finally, all edges connecting sampled vertices in the initial graph are added,
which finalizes the sample graph. The final sample size n of the sampled graph
is thus random. The graph on the left illustrates the first and the final step of the procedure on
an example graph. All estimates for WVISS computations uses sample weights. We show that the
corresponding Horvitz-Thompson weights can be computed in closed form for each vertex k of the
graph, thus providing unbiased estimates for any mean or total of a linear variable:
wk =
1
1 − j∈Bk
(1 − πj)
where Bk is the set of vertices having an edge pointing to vertex k.
5. Application to machine learning (recommendations)
0.0
0.2
0.4
0.6
0.00 0.05 0.10 0.15 0.20 0.25
Sampling fraction
Top10accuracy
Method
Uniform induced
Unweighted snowball
WVISS
We simulated a graph recommendation problem
using a co-purchases graph of items generated
from a forest-fire model (of order N = 8000 and
fwprobs = 0.15). The goal is to measure the
next purchases of users. Each user who just
purchased object i has a hidden preference for
object j determined by the equation:
(1) preferencej = β1degreej + β2distancei,j
The graph on the left shows the accuracy of
the next 10 purchases prediction (top 10 accu-
racy) for each sampling algorithm for some val-
ues of the sample sizes (expressed in fraction of
N), and β1 = 0.2, β2 = 0.1. WVISS is consis-
tently more accurate than the other sampling
algorithms. Reaching accuracy of 0.5 requires
nearly half as many units with WVISS
than unweighted snowball.
6. References
[1] Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009.
[2] Seth A Myers and Jure Leskovec. The bursty dynamics of the twitter information network. In
Proceedings of the 23rd international conference on World wide web, pages 913–924. ACM, 2014.
[3] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636.
ACM, 2006.

More Related Content

What's hot

Using Generic Image Processing Operations to Detect a Calibration Grid
Using Generic Image Processing Operations to Detect a Calibration GridUsing Generic Image Processing Operations to Detect a Calibration Grid
Using Generic Image Processing Operations to Detect a Calibration GridJan Wedekind
 
Digital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmDigital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmKasun Ranga Wijeweera
 
Application of Integrals
Application of IntegralsApplication of Integrals
Application of Integralssarcia
 
3 d active meshes for cell tracking
3 d active meshes for cell tracking3 d active meshes for cell tracking
3 d active meshes for cell trackingPrashant Pal
 
2016 SMU Research Day
2016 SMU Research Day2016 SMU Research Day
2016 SMU Research DayLiu Yang
 
1575 numerical differentiation and integration
1575 numerical differentiation and integration1575 numerical differentiation and integration
1575 numerical differentiation and integrationDr Fereidoun Dejahang
 
L7 moment area theorems
L7 moment area theoremsL7 moment area theorems
L7 moment area theoremsDr. OmPrakash
 
Basics of CT- Lecture 9.ppt
Basics of CT- Lecture 9.pptBasics of CT- Lecture 9.ppt
Basics of CT- Lecture 9.pptMagde Gad
 
Applied numerical methods lec10
Applied numerical methods lec10Applied numerical methods lec10
Applied numerical methods lec10Yasser Ahmed
 
Integration application (Aplikasi Integral)
Integration application (Aplikasi Integral)Integration application (Aplikasi Integral)
Integration application (Aplikasi Integral)Muhammad Luthfan
 
Computer Graphics Modellering engels
Computer Graphics Modellering engelsComputer Graphics Modellering engels
Computer Graphics Modellering engelsChristian Kehl
 
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...kagikenco
 
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...atsidaev
 
Kuliah teori dan analisis jaringan - linear programming
Kuliah teori dan analisis jaringan - linear programmingKuliah teori dan analisis jaringan - linear programming
Kuliah teori dan analisis jaringan - linear programmingHarun Al-Rasyid Lubis
 
MATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and IntegrationMATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and IntegrationAinul Islam
 

What's hot (20)

Using Generic Image Processing Operations to Detect a Calibration Grid
Using Generic Image Processing Operations to Detect a Calibration GridUsing Generic Image Processing Operations to Detect a Calibration Grid
Using Generic Image Processing Operations to Detect a Calibration Grid
 
Digital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing AlgorithmDigital Differential Analyzer Line Drawing Algorithm
Digital Differential Analyzer Line Drawing Algorithm
 
Quatum fridge
Quatum fridgeQuatum fridge
Quatum fridge
 
Application of Integrals
Application of IntegralsApplication of Integrals
Application of Integrals
 
3 d active meshes for cell tracking
3 d active meshes for cell tracking3 d active meshes for cell tracking
3 d active meshes for cell tracking
 
2016 SMU Research Day
2016 SMU Research Day2016 SMU Research Day
2016 SMU Research Day
 
1575 numerical differentiation and integration
1575 numerical differentiation and integration1575 numerical differentiation and integration
1575 numerical differentiation and integration
 
L25 ppt conjugate
L25 ppt conjugateL25 ppt conjugate
L25 ppt conjugate
 
L7 moment area theorems
L7 moment area theoremsL7 moment area theorems
L7 moment area theorems
 
Basics of CT- Lecture 9.ppt
Basics of CT- Lecture 9.pptBasics of CT- Lecture 9.ppt
Basics of CT- Lecture 9.ppt
 
Applied numerical methods lec10
Applied numerical methods lec10Applied numerical methods lec10
Applied numerical methods lec10
 
Integration application (Aplikasi Integral)
Integration application (Aplikasi Integral)Integration application (Aplikasi Integral)
Integration application (Aplikasi Integral)
 
Cantilever1
Cantilever1Cantilever1
Cantilever1
 
Rs lab 06
Rs lab 06Rs lab 06
Rs lab 06
 
Computer Graphics Modellering engels
Computer Graphics Modellering engelsComputer Graphics Modellering engels
Computer Graphics Modellering engels
 
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...
FDTD Analysis of the Complex Current Distribution on a circular disk exposed ...
 
MUMS: Transition & SPUQ Workshop - Dimension Reduction and Global Sensititvit...
MUMS: Transition & SPUQ Workshop - Dimension Reduction and Global Sensititvit...MUMS: Transition & SPUQ Workshop - Dimension Reduction and Global Sensititvit...
MUMS: Transition & SPUQ Workshop - Dimension Reduction and Global Sensititvit...
 
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...
3D Radio Holographic Images Synthesis and Filtration on Multiprocessor Comput...
 
Kuliah teori dan analisis jaringan - linear programming
Kuliah teori dan analisis jaringan - linear programmingKuliah teori dan analisis jaringan - linear programming
Kuliah teori dan analisis jaringan - linear programming
 
MATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and IntegrationMATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and Integration
 

Similar to Network Sampling for Big Data and ML

論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...Ken Sakurada
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
 
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism HardwarePerformance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism HardwareCSCJournals
 
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector MachineSumit Singh
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)NYversity
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTAM Publications
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statisticsTarun Gehlot
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesAdi Handarbeni
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18Matt Yang
 
Real time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target trackingReal time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target trackingIAEME Publication
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Harshal Chaudhari
 
Vector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdfVector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdfNesrine Wagaa
 

Similar to Network Sampling for Big Data and ML (20)

論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
論文紹介"DynamicFusion: Reconstruction and Tracking of Non-­‐rigid Scenes in Real...
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism HardwarePerformance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
 
I0343065072
I0343065072I0343065072
I0343065072
 
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector Machine
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECT
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statistics
 
C046011620
C046011620C046011620
C046011620
 
Kk2518251830
Kk2518251830Kk2518251830
Kk2518251830
 
Kk2518251830
Kk2518251830Kk2518251830
Kk2518251830
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
 
reportVPLProject
reportVPLProjectreportVPLProject
reportVPLProject
 
Ch11.kriging
Ch11.krigingCh11.kriging
Ch11.kriging
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
 
Real time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target trackingReal time implementation of unscented kalman filter for target tracking
Real time implementation of unscented kalman filter for target tracking
 
Neural networks
Neural networksNeural networks
Neural networks
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
 
Vector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdfVector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdf
 

More from Antoine Rebecq

Bring survey sampling techniques into big data
Bring survey sampling techniques into big dataBring survey sampling techniques into big data
Bring survey sampling techniques into big dataAntoine Rebecq
 
Sampling graphs efficiently - MAD Stat (TSE)
Sampling graphs efficiently - MAD Stat (TSE)Sampling graphs efficiently - MAD Stat (TSE)
Sampling graphs efficiently - MAD Stat (TSE)Antoine Rebecq
 
Tirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEETirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEEAntoine Rebecq
 
Optimisation d'une allocation mixte
Optimisation d'une allocation mixteOptimisation d'une allocation mixte
Optimisation d'une allocation mixteAntoine Rebecq
 
Calage sur bornes minimales
Calage sur bornes minimalesCalage sur bornes minimales
Calage sur bornes minimalesAntoine Rebecq
 
Sampling methods for graphs
Sampling methods for graphsSampling methods for graphs
Sampling methods for graphsAntoine Rebecq
 
Sampling the Twitter graph
Sampling the Twitter graphSampling the Twitter graph
Sampling the Twitter graphAntoine Rebecq
 

More from Antoine Rebecq (7)

Bring survey sampling techniques into big data
Bring survey sampling techniques into big dataBring survey sampling techniques into big data
Bring survey sampling techniques into big data
 
Sampling graphs efficiently - MAD Stat (TSE)
Sampling graphs efficiently - MAD Stat (TSE)Sampling graphs efficiently - MAD Stat (TSE)
Sampling graphs efficiently - MAD Stat (TSE)
 
Tirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEETirage spatialement équilibré - INSEE
Tirage spatialement équilibré - INSEE
 
Optimisation d'une allocation mixte
Optimisation d'une allocation mixteOptimisation d'une allocation mixte
Optimisation d'une allocation mixte
 
Calage sur bornes minimales
Calage sur bornes minimalesCalage sur bornes minimales
Calage sur bornes minimales
 
Sampling methods for graphs
Sampling methods for graphsSampling methods for graphs
Sampling methods for graphs
 
Sampling the Twitter graph
Sampling the Twitter graphSampling the Twitter graph
Sampling the Twitter graph
 

Recently uploaded

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Recently uploaded (20)

CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

Network Sampling for Big Data and ML

  • 1. Network sampling and applications to big data and machine learning Antoine Rebecq Université Paris Nanterre, 200 av. de la République, 92000 Nanterre, FRANCE Shopify Montréal, 490 rue de la Gauchetière O, Montréal H2Z0B3, CANADA 1. Introduction Applications of graph (network) data are increasingly popular in the tech industry. Used in many applications, including statistical analyses and machine learning, graph data processing is often very costly and pose challenges when scaling. Here we investigate different graph sampling algorithms, describe their efficiency, and showcase their application using a basic recommender system problem. We propose Weighted vertex-induced snowball sampling (WVISS) as it is found to be more efficient than many competing graph sampling algorithms. 2. Graph topology is very diverse Real-world graphs often possess difficult-to- model properties ([1]), including the scale-free property (distribution of degrees have long tails) and the small-world property (path lengths are short). Since the 1990s, probabilistic models have been created to capture these properties, such as Barabasi and Albert’s (scale-free) or Watts and Strogatz’s (small-world). Real-world graphs are often a combination of both and thus challenging to model. The follow- ing graph shows distribution of the log-degree and path lengths for the Twitter graph ([2]). They both correspond to the scale-free and small-world properties: 4. General WVISS efficiency We measured the precision of WVISS-based esti- mates on simulations on graphs generated from probabilistic models and real-world networks. The following graph shows the design effect of WVISS based on local clustering (which is the precision of the estimates obtained compared to a plan with uniform probabilities of the same size). A design effect greater than 1 indicates that it is more cost-effective to run simple sam- pling than WVISS and vice-versa: This graph shows that efficiency is difficult to predict and depends very much on the graph and the sampling strategy chosen. 3. Weighted vertex-induced snowball sampling Probabilistic sampling involves selecting n units of a population of size N at random with (inclusion) probability πk for each k ∈ U. Given a variable of interest taking values yk, an unbiased estimate of its mean can be computed using Horvitz-Thompson’s formula: ˆ¯y = 1 N k yk πk . The coefficients wk = 1 πk are called sampling weights. Maximum precision is obtained when πk ∝ yk. In practice, yk is unobserved so inclusion probabilities are computed using some auxiliary information correlated to yk. One of the simplest methods for graph sampling is uniform vertex-induced subgraph sampling, which consists in selecting vertices at random with uniform probabilities (πk = c), c constant, along with any edge that connects two ver- tices of the sample. Snowball sampling (unweighted), described in [3], consists in selecting vertices using uniform probability and then adding all their neigh- bors (plus the induced vertices) to the sample. Generally, it uses unweighted estimates. Weighted vertex-induced snowball sampling (WVISS) works in three phases: first, a sample of n0 vertices is drawn randomly with a specific strategy based on external information and/or graph topology. Second, all vertices that are connected to the vertices selected in the first phase are added to the sample. Finally, all edges connecting sampled vertices in the initial graph are added, which finalizes the sample graph. The final sample size n of the sampled graph is thus random. The graph on the left illustrates the first and the final step of the procedure on an example graph. All estimates for WVISS computations uses sample weights. We show that the corresponding Horvitz-Thompson weights can be computed in closed form for each vertex k of the graph, thus providing unbiased estimates for any mean or total of a linear variable: wk = 1 1 − j∈Bk (1 − πj) where Bk is the set of vertices having an edge pointing to vertex k. 5. Application to machine learning (recommendations) 0.0 0.2 0.4 0.6 0.00 0.05 0.10 0.15 0.20 0.25 Sampling fraction Top10accuracy Method Uniform induced Unweighted snowball WVISS We simulated a graph recommendation problem using a co-purchases graph of items generated from a forest-fire model (of order N = 8000 and fwprobs = 0.15). The goal is to measure the next purchases of users. Each user who just purchased object i has a hidden preference for object j determined by the equation: (1) preferencej = β1degreej + β2distancei,j The graph on the left shows the accuracy of the next 10 purchases prediction (top 10 accu- racy) for each sampling algorithm for some val- ues of the sample sizes (expressed in fraction of N), and β1 = 0.2, β2 = 0.1. WVISS is consis- tently more accurate than the other sampling algorithms. Reaching accuracy of 0.5 requires nearly half as many units with WVISS than unweighted snowball. 6. References [1] Eric D Kolaczyk. Statistical analysis of network data. Springer, 2009. [2] Seth A Myers and Jure Leskovec. The bursty dynamics of the twitter information network. In Proceedings of the 23rd international conference on World wide web, pages 913–924. ACM, 2014. [3] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636. ACM, 2006.