This document summarizes a presentation given by Nesreen K. Ahmed on graph sampling techniques. It discusses previous work on sampling large graphs to estimate properties like triangle counts. Existing methods either require multiple passes over the data or make assumptions about the graph stream order. The presentation introduces a new single-pass Graph Priority Sampling framework that can estimate properties in an unbiased way using a fixed-size sample. It assigns edge weights and priorities to sample edges proportional to their contribution to graph structures. Estimates can be updated incrementally during the stream or retrospectively after it ends. The framework is evaluated on real-world graphs with billions of edges to estimate triangle counts, wedge counts, and clustering coefficients with low variance.
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs(e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy.While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytic swill run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
Short presentation of the paper "Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization"
https://arxiv.org/abs/1707.06468
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs(e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy.While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytic swill run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
Short presentation of the paper "Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization"
https://arxiv.org/abs/1707.06468
Introduction of "TrailBlazer" algorithmKatsuki Ohto
論文「Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning」紹介スライドです。NIPS2016読み会@PFN(2017/1/19) https://connpass.com/event/47580/ にて。
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...Yuki Oyama
Oyama, Y., Hato, E. (2015) Incorporating context-dependent energy into the pedestrian dynamic scheduling model with GPS data. The 14th International Conference on Travel Behaviour research (IATBR), Windsor, England.
Graphs are the natural data structure to represent relations. Graph algorithms show irregular memory access pattern. This causes, distributed-memory parallel graph algorithms to do more communication than computation. When an algorithm generates more work the more communication they need to do. The amount of work can be reduced with frequent synchronization. However, the overhead of frequent synchronization reduces the performance of distributed-memory parallel graph algorithms. Abstract Graph Machine (AGM) is a model that can control the amount of synchronization and the amount of work generated by an algorithm,
"Scalable Link Discovery for Modern Data-Driven Applications" as presented in the 15th International Semantic Web Conference ISWC, Doctoral Consortium, October 18th, 2016, held in Kobe, Japan
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...Nick Pruehs
Mikkel Thorup found the first deterministic algorithm to solve the classic single-source shortest paths problem for undirected graphs with positive integer weights in linear time and space. The algorithm requires a hierarchical bucketing structure for identifying the order the vertices have to be visited in without breaking this time bound, thus avoiding the sorting bottleneck of the algorithm proposed by Dijkstra in 1959.
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks Ryan Rossi
Massively parallel architectures such as the GPU are becoming increasingly important due to the recent proliferation of data. In this paper, we propose a key class of hybrid parallel graphlet algorithms that leverages multiple CPUs and GPUs simultaneously for computing k-vertex induced subgraph statistics (called graphlets). In addition to the hybrid multi-core CPU-GPU framework, we also investigate single GPU methods (using multiple cores) and multi-GPU methods that leverage all available GPUs simultaneously for computing induced subgraph statistics. Both methods leverage GPU devices only, whereas the hybrid multi-core CPU-GPU framework leverages all available multi-core CPUs and multiple GPUs for computing graphlets in large networks. Compared to recent approaches, our methods are orders of magnitude faster, while also more cost effective enjoying superior performance per capita and per watt. In particular, the methods are up to 300+ times faster than a recent state-of-the-art method. To the best of our knowledge, this is the first work to leverage multiple CPUs and GPUs simultaneously for computing induced subgraph statistics.
Initial Graphulo Graph Analytics Expressed in GraphBLAS:
GraphBLAS is an effort to define standard building blocks for graph algorithms in the language of linear algebra. Graphulo is a project to implement the GraphBLAS using Accumulo.
Introduction of "TrailBlazer" algorithmKatsuki Ohto
論文「Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning」紹介スライドです。NIPS2016読み会@PFN(2017/1/19) https://connpass.com/event/47580/ にて。
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Yuki Oyama - Incorporating context-dependent energy into the pedestrian dynam...Yuki Oyama
Oyama, Y., Hato, E. (2015) Incorporating context-dependent energy into the pedestrian dynamic scheduling model with GPS data. The 14th International Conference on Travel Behaviour research (IATBR), Windsor, England.
Graphs are the natural data structure to represent relations. Graph algorithms show irregular memory access pattern. This causes, distributed-memory parallel graph algorithms to do more communication than computation. When an algorithm generates more work the more communication they need to do. The amount of work can be reduced with frequent synchronization. However, the overhead of frequent synchronization reduces the performance of distributed-memory parallel graph algorithms. Abstract Graph Machine (AGM) is a model that can control the amount of synchronization and the amount of work generated by an algorithm,
"Scalable Link Discovery for Modern Data-Driven Applications" as presented in the 15th International Semantic Web Conference ISWC, Doctoral Consortium, October 18th, 2016, held in Kobe, Japan
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...Nick Pruehs
Mikkel Thorup found the first deterministic algorithm to solve the classic single-source shortest paths problem for undirected graphs with positive integer weights in linear time and space. The algorithm requires a hierarchical bucketing structure for identifying the order the vertices have to be visited in without breaking this time bound, thus avoiding the sorting bottleneck of the algorithm proposed by Dijkstra in 1959.
Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks Ryan Rossi
Massively parallel architectures such as the GPU are becoming increasingly important due to the recent proliferation of data. In this paper, we propose a key class of hybrid parallel graphlet algorithms that leverages multiple CPUs and GPUs simultaneously for computing k-vertex induced subgraph statistics (called graphlets). In addition to the hybrid multi-core CPU-GPU framework, we also investigate single GPU methods (using multiple cores) and multi-GPU methods that leverage all available GPUs simultaneously for computing induced subgraph statistics. Both methods leverage GPU devices only, whereas the hybrid multi-core CPU-GPU framework leverages all available multi-core CPUs and multiple GPUs for computing graphlets in large networks. Compared to recent approaches, our methods are orders of magnitude faster, while also more cost effective enjoying superior performance per capita and per watt. In particular, the methods are up to 300+ times faster than a recent state-of-the-art method. To the best of our knowledge, this is the first work to leverage multiple CPUs and GPUs simultaneously for computing induced subgraph statistics.
Initial Graphulo Graph Analytics Expressed in GraphBLAS:
GraphBLAS is an effort to define standard building blocks for graph algorithms in the language of linear algebra. Graphulo is a project to implement the GraphBLAS using Accumulo.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Modeling and Roll, Pitch and Yaw Simulation of Quadrotor.Oka Danil
In this paper, we developed a prototype of a quadrotor and proposes how to model and conduct simulations to investigate the effect of roll, pitch and yaw as the inputs to the outputs of φ, θ and ψ angle in quadrotor. The Euler-Newton formalism is used to model the dynamic system. The simulation results show that, the majority of φ angle is
determined by the roll, most of the θ angle is determined by the pitch, and the ψ angle is determined by the yaw.
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...Daiki Tanaka
paper at ICML 2019; "L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR STRUCTURED DATA"
openr eview link : https://openreview.net/forum?id=S1E3Ko09F7
We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms
that use space that is linear or sublinear in the dimension. We prove general results showing that any sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
-Proceedings: https://arxiv.org/abs/1804.03065
-Origin: https://arxiv.org/abs/1804.03065
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)NAVER Engineering
발표자: 신기정(CMU 박사 과정)
발표일: 2018.1.
웹, 소셜 미디어, 리뷰 데이터 등 어디서든 쉽게 그래프 데이터를 찾아볼 수 있습니다. 그중 많은 데이터는 테라바이트 혹은 그 이상의 대용량이고, 계속 변화합니다. 또한, 많은 부가 정보를 포함하고 있어 텐서, 즉 다차원 행렬의 형태로 표현됩니다.
본 발표에서는 이러한 대용량의 동적인 그래프 및 텐서 데이터의 구조를 분석하고 이상점을 탐지하기 위한 알고리즘들을 소개합니다. 특히, 그래프 내의 삼각형의 개수를 정확하게 예측하는 기법과 비이상적으로 밀집된 지점을 탐색하는 기법을 중점적으로 소개할 예정입니다.
소개될 알고리즘들은 샘플링과 근사 기법을 활용하며 분산 환경과 스트림 환경에 적합하게 설계되었습니다. 또한, 소셜 미디어의 가짜 영향력자 탐색, 가짜 리뷰 탐색, 네트워크 공격 탐색 등에 적용될 수 있습니다.
In order to be able to visulaize the data, or simply to speed up the process of learning without loosing the important features, we apply dimensionality reduction. methods.
We will talk about 2 methods: PCA and manifold.
[Notebook](https://colab.research.google.com/drive/1_ksjf1K49dUA8XtyDGoL5V3JEajHvFHb)
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. Joint work with:
Nick Duffield -‐ Texas A&M University
Ted Willke – Intel Labs
Ryan Rossi – PARC research
VLDB’17, Germany
August 31st, 2017
Nesreen K. Ahmed
Research Scientist, Intel Labs
2. -‐
-‐
-‐
-‐
-‐
Social network
Human Disease Network
[Barabasi 2007]
Food Web [2007]
Terrorist Network
[Krebs 2002]Internet (AS) [2005]
Gene Regulatory Network
[Decourty 2008]
Protein Interactions
[breast cancer]
Political blogs
Power grid
4. Studying and analyzing complex networks
is a challenging and computationally intensive task
Studying and analyzing complex networks
is a challenging and computationally intensive task
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
5. Studying and analyzing complex networks
is a challenging and computationally intensive task
Studying and analyzing complex networks
is a challenging and computationally intensive task
Due to these challenges, we usually need to sampleDue to these challenges, we usually need to sample
Statistical
Sampling
Graph G Sample S
e.g. Uniform Random
Sampling
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
Ø Today’s networks are dynamic/streaming over time
-‐ e.g., Twitter streams, email communications
Ø Today’s networks are massive in size
-‐ e.g., online social networks have billions of users
6. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
7. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
8. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
No. TrianglesNo. Wedges
Frequent connected subsets of edges
9. Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Given a large graph G represented as a stream of edges e1, e2, e3…
We show how to efficiently sample from G while limiting memory
space to calculate unbiased estimates of various graph properties
Frequent connected subsets of edges
Transitivity
No. TrianglesNo. Wedges
10. § Random Sampling
• Uniform random sampling – [Tsourakakis et. al KDD’09]
— Graph Sparsification with probability p
— Chance of sampling a subgraph (e.g., triangle) is very low
— Estimates suffer from high variance
• Wedge Sampling – [Seshadhri et. al SDM’13]
— Sample vertices, then sample pairs of incident edges (wedges)
— Output the estimate of the closed wedges (triangles)
11. § Random Sampling
• Uniform random sampling – [Tsourakakis et. al KDD’09]
— Graph Sparsification with probability p
— Chance of sampling a subgraph (e.g., triangle) is very low
— Estimates suffer from high variance
• Wedge Sampling – [Seshadhri et. al SDM’13]
— Sample vertices, then sample pairs of incident edges (wedges)
— Output the estimate of the closed wedges (triangles)
Assume we’ve access to the full graph
Not a good fit for massive streaming graphs
12. § Assume specific order of the stream – [Buriol et. al 2006]
• Incidence stream model– vertex neighbors arrive together in the stream
§ Use multiple passes over the stream – [Becchetti et. al KDD’08]
§ Single-‐pass algorithms for arbitrary-‐ordered graph streams
13. § Single-‐pass algorithms for arbitrary-‐ordered graph streams
• Streaming-‐Triangles – [Jha et. al KDD’13]
— Sample edges using reservoir sampling, then sample pairs of incident
edges (wedges), and finally scan for closed wedges (triangles)
• Neighborhood Sampling – [Pavan et. al VLDB’13]
— Sampling vectors of wedge estimators, scan the stream for closed wedges
(triangles)
• TRIEST– [De Stefani et. al KDD’16]
— Uses standard reservoir sampling to maintain the edge sample
• MASCOT– [Lim et. al KDD’15]
— Independent edge sampling with probability p
• Graph Sample & Hold– [Ahmed et. al KDD’14]
— Conditionally independent edge sampling
14. Summary of previous work
Sampling designs for specific graph properties (triangles)
Not generally applicable to other properties
Uniform-‐based Sampling
Obtain variable-‐size sample
We propose a generic unbiased sampling framework: Graph Priority Sampling
• Weight-‐sensitive
• Fixed-‐size sample
• Single-‐pass
• Applicable for general graph properties
• Use topological information that we wish to estimate as auxiliary variables
• Variance-‐optimal sampling (cost optimization approach)
16. Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Generate a random number
u(k) ⇠ Uni(0, 1]
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge weight
w(k) = W(k, ˆK)
Compute edge priority
r(k) = w(k)/u(k)
ˆK = ˆK [ {k}
17. Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Find edge with lowest priority
k⇤
= arg mink02 ˆK r(k0
)
Update sample threshold
z⇤
= max{z⇤
, r(k⇤
)}
Remove lowest priority edge
ˆK = ˆK{k⇤
}
Use a priority queue with O(log m) updates
18. § We use edge weights to express the role of the arriving
edge in the sampled graph
• e.g., no. subgraphs completed by the arriving edge, and/or other
auxiliary variables
§ Computational feasibility
• Efficient implementation by using a priority queue
• Implemented as a Min-‐heap with O(log m) insertion/deletion
• O(1) access to the edge with minimum priority
w(k) = W(k, ˆK)
19. For each edge i,
we construct a sequence of edge estimators ˆSi,t
We achieve unbiasedness by
establishing that the sequence is a Martingale (Theorem 1)
E[ ˆSi,t] = Si,t
ˆSi,t = I(i 2 ˆKt)/min{1, wi/z⇤
}
where ˆSi,t are unbiased estimators of the corresponding edge
ˆKt is the sample at time t
Edge Estimation
20. For each subgraph J ⇢ [t],
we define the sequence of subgraph estimators as
ˆSJ,t =
Q
i2J
ˆSi,t
E[ ˆSJ,t] = SJ,t
We prove the sequence is a Martingale (Theorem 2)
Subgraph Estimation
21. Subgraph Counting
For any set J of subgraphs of G,
ˆNt(J ) =
P
J2J :J⇢Kt
ˆSJ,t
is an unbiased estimator of Nt(J ) = |Jt|
(Theorem 2)
22. § We provide a cost minimization approach
• inspired by IPPS sampling [Duffield et. al 2005]
§ By minimizing the conditional variance of the increment
incurred by the arriving edge in
How the ranks ri,t should be distributed in order to minimize
the variance of the unbiased estimator of Nt(J )?
Nt(J )
23. § Post-‐stream Estimation
• enables retrospective subgraph queries
• after any number t of edge arrivals have taken place, we can
compute an unbiased estimator for any subgraph
§ In-‐stream Estimation
• we can take “snapshots” of estimates of specific sampled subgraphs
at arbitrary times during the stream
• Still Unbiased!
• Lightweight online/incremental update of unbiased estimates of
subgraph counts
• Same sampling procedure
• Using stopped Martingale
24. Input
Graph Priority Sampling Framework
GPS(m)
Output
For each edge k
Edge stream
k1, k2, ..., k, ...
Sampled Edge stream ˆK
Stored State m = O(| ˆK|)
Compute edge priority
r(k) = w(k)/u(k)
Update the sample
Update unbiased estimates
of subgraph counts
25. In-stream Estimation
We define a snapshot as an edge subset J, with a family of
stopping times T such that T = {Tj : j 2 J}
We prove the sequence is a stopped Martingale (Theorem 4)
ˆST
J,t =
Q
j2J
ˆS
Tj
j,t =
Q
j2J
ˆSj,min{Tj ,t}
E[ ˆST
J,t] = SJ,t
26. § We use GPS for the estimation of
• Triangle counts
• Wedge counts
• Global clustering coefficient
• And their unbiased variance (Theorem 3 in the paper)
• Weight function
• Used a large set of graphs from a variety of domains (social, we,
tech, etc) -‐ data is available on http://networkrepository.com/
— Up to 49B edges
W(k, ˆK) = 9 ⇤ ˆ4(k) + 1
where ˆ4(k) is the number of triangles
completed by edge k and whose edges in ˆK
27. - GPS accurately estimates various properties simultaneously
- Consistent performance across graphs from various domains
- A key advantage for GPS in-stream has smaller variance and tight confidence bounds
28. Results for triangle counts
Using massive real-world and synthetic graphs of up to 49B edges
GPS is shown to be accurate with <0.01 error
Sample size = 1M edges, in-stream estimation
95% confidence intervals
31. 0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
8
Stream Size at time t (|Kt|)
Trianglesattimet(xt)
soc−orkut
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
Actual
Estimate
Upper Bound
Lower Bound
0 2 4 6 8 10 12
x 10
7
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Stream Size at time t (|Kt|)
ClusteringCoeff.attimet(xt)
soc−orkut
GPS in-stream estimates over time
Sample size = 80K edges
95% confidence intervals
32. 0.994 0.996 0.998 1 1.002 1.004 1.006
0.994
0.996
0.998
1
1.002
1.004
1.006
ca-hollywood-2009
com-amazon
higgs-social-network
soc-flickr
soc-youtube-snap
socfb-Indiana69
socfb-Penn94
socfb-Texas84
socfb-UF21
tech-as-skitter
web-BerkStan
web-google
GPS In-stream Estimation, sample size 100K edges
GPS accurately estimates both triangle and wedge counts
simultaneously with a single sample
33.
34. We observe accurate results with no significant difference in error between
the ordering schemes
35. § We used three schemes for weighting edges during sampling
§ Goal: estimate triangle counts for Friendster social network
with sample size=1M (0.1% of the graph)
1. triangle-‐based weights (3% relative error)
2. wedge-‐based weights (25% relative error)
3. uniform weights for all incoming edges (43% relative error)
-‐ this is equivalent to simple random sampling
The estimator variance was 3.8x higher using wedge-based weights, and
6.2x higher using uniform weights compared to triangle-based weights.
36. § A sample is representative if graph properties of interest can be
estimated with a known degree of accuracy
§ We proposed a generic framework Graph Priority Sampling (GPS)
-‐ GPS is an efficient single-‐pass streaming framework
-‐ GPS selects a representative sample and computes unbiased estimates of
counts of connected subsets of edges (e.g., triangles, wedges …)
-‐ Theoretical properties of GPS are supported by empirical analysis
§ GPS admits generalizations by allowing the dependence of the
sampling process as a function of the stored state and/or auxiliary
variables
§ GPS is variance minimizing sampling approach
§ GPS has a relative estimation error < 1%