Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
TRIÈST: Approximating Triangle Counts
in Fully-Dynamic Graph Edge Streams
with Fixed Memory
Matteo Riondato – Labs, Two Si...
Who am I?
Matteo Riondato
Working at
Labs, Two Sigma Investments (Research Scientist);
CS Dept., Brown U. (Visiting Asst. ...
What am I going to talk about?
TRIÈST: a suite of algorithms for approximately counting triangles in fully-dynamic
edge st...
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global tri...
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global tri...
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global tri...
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global tri...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge up...
Why is working on fully-dynamic edge streams difficult?
The stream is infinite: storing all (or a constant fraction of) the e...
Why is working on fully-dynamic edge streams difficult?
The stream is infinite: storing all (or a constant fraction of) the e...
What is TRIÈST?
(the local dialect name of Trieste, a city in the North-East of Italy, next to Slovenia.)
TRIÈST (TRIangle...
Aren’t there other algorithms to estimate triangles?
There are many algorithms for estimating triangles from data streams;...
What is the general idea behind TRIÈST?
Let’s focus on TRIÈST-BASE for now (i.e., insertion-only streams);
TRIÈST-BASE mai...
How does TRIÈST-BASE work?
TRIÈST-BASE uses a random sampling scheme known as reservoir sampling;
At any time t ≤ M, deter...
Is an example worth a thousand words?
Memory: M = 8; Time: end of t∗ − 1;
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Global t...
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗;
Actions:
Graph...
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outc...
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outc...
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outc...
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outc...
Is an example worth a thousand words?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Coin bias: M/(t∗ + 1); Coin...
Is an example worth a thousand words?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Coin bias: M/(t∗ + 1); Coin...
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subs...
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subs...
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subs...
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subs...
Where are the theorems?
We give complete analysis of unbiasedness, variance, and novel concentration bounds;
The events “e...
Where are the theorems?
We give complete analysis of unbiasedness, variance, and novel concentration bounds;
The events “e...
Ok, but can I show you something?
To exactly show the variance of TRIÈST-BASE estimator ∆GS
:
1) Express variance as sum o...
What is wrong with TRIÈST-BASE?
Weaknesses:
1) -BASE uses the exact value of ∆GS
at time t to estimate ∆G(t) ;
Over time, ...
What is wrong with TRIÈST-BASE?
Weaknesses:
1) -BASE uses the exact value of ∆GS
at time t to estimate ∆G(t) ;
Over time, ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: end of t∗ − 1;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Action: Weighted increment of λ using ...
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Action: Weighted increment of λ using ...
How does TRIÈST-IMPR estimate the number of triangles?
TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) .
17 / 26
How does TRIÈST-IMPR estimate the number of triangles?
TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) .
Corolla...
Where are the theorems?
The order of the updates on the streams affects the probability of “seeing” a triangle;
This furthe...
What about fully-dynamic edge streams?
Handling deletions is hard;
TRIÈST-FD’s approach is inspired by random pairing (Gem...
Where are the experiments?
Implementation: C++. Available from http://bit.ly/triestkdd
Graphs: Last.fm, Patent-Cit, Patent...
How does TRIÈST-IMPR perform?
Yahoo! graph with 1.2 billion edges (computing exact ∆G is infeasible);
Space M = 1 million ...
How does TRIÈST-IMPR perform compared to other methods?
Last.fm graph (40 million edges, 1 billion triangles);
Space M = 1...
How does TRIÈST-FD perform?
0
200000
400000
600000
800000
1x10
6
1.2x10
6
1.4x10
6
1.6x10
6
0
5x10
6
1x10
7
1.5x10
7
2x10
...
How scalable is TRIÈST-FD?
We measured the average time to handle an update on the stream;
1
10
100
1000
10000
patent-cit
...
What didn’t I tell you?
The Goods:
Concentration results (the one for TRIÈST-BASE is very elegant;)
Theorems for TRIÈST-FD...
What did I talk about?
TRIÈST: three algorithms for triangle counts estimation in fully-dynamic edge streams;
• Uses a fixe...
This document is being distributed for informational and educational purposes only and is not an offer to sell or the soli...
Upcoming SlideShare
Loading in …5
×

of

TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 1 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 2 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 3 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 4 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 5 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 6 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 7 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 8 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 9 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 10 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 11 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 12 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 13 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 14 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 15 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 16 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 17 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 18 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 19 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 20 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 21 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 22 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 23 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 24 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 25 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 26 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 27 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 28 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 29 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 30 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 31 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 32 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 33 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 34 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 35 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 36 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 37 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 38 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 39 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 40 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 41 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 42 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 43 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 44 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 45 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 46 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 47 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 48 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 49 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 50 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 51 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 52 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 53 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 54 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 55 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 56 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 57 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 58 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 59 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 60 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 61 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 62 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 63 TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Slide 64
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

Download to read offline

The authors present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.

  • Be the first to like this

TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

  1. 1. TRIÈST: Approximating Triangle Counts in Fully-Dynamic Graph Edge Streams with Fixed Memory Matteo Riondato – Labs, Two Sigma Investments CMU DB Group – October 24, 2016 1 / 26
  2. 2. Who am I? Matteo Riondato Working at Labs, Two Sigma Investments (Research Scientist); CS Dept., Brown U. (Visiting Asst. Prof.); Doing research in algorithmic data science (used to be data mining, but somehow we forgot about algorithms. . . ); algorithmic data science = (theory × practice)(theory×practice) Tweeting @teorionda; “Living” at http://matteo.rionda.to. 2 / 26
  3. 3. What am I going to talk about? TRIÈST: a suite of algorithms for approximately counting triangles in fully-dynamic edge streams, using a fixed amount of storage/space/memory. Joint work with: • Lorenzo De Stefani (Brown); • Alessandro Epasto (Google Research); • Eli Upfal (Brown); Best student paper award at ACM KDD’16; Journal version under submission to ACM TKDD, available from http://bit.ly/triestkdd; TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Lorenzo De Stefani Brown University Providence, RI, USA lorenzo@cs.brown.edu Alessandro Epastoú Google New York, NY, USA aepasto@google.com Matteo Riondato* Two Sigma Investments New York, NY, USA matteo@twosigma.com Eli Upfal Brown University Providence, RI, USA eli@cs.brown.edu “Ogni lassada xe persa”1 – Proverb from Trieste, Italy. ABSTRACT We present trièst, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approxima- tions of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches, which require hard-to- choose parameters (e.g., a fixed sampling probability) and o er no guarantees on the amount of memory they use. We analyze the variance of the estimations and show novel con- centration bounds for these quantities. Our experimental results on very large graphs demon- strate that trièst outperforms state-of-the-art approaches in accuracy and exhibits a small update time. 1. INTRODUCTION Exact computation of characteristic quantities of Web- scale networks is often impractical or even infeasible due approximation of these quantities. For e ciency, the algo- rithms should aim at exploiting the available memory space as much as possible and they should require only one pass over the stream. We introduce trièst, a suite of sampling-based, one-pass algorithms for adversarial fully-dynamic streams to approx- imate the global number of triangles and the local number of triangles incident to each vertex. Mining local and global triangles is a fundamental primitive with many applications (e.g., community detection [4], topic mining [10], spam/anomaly detection [3, 27], ego-networks mining [12] and protein in- teraction networks analysis [29].) Many previous works on triangle estimation in streams also employ sampling (see Sect. 3), but they usually require the user to specify in advance an edge sampling probability p that is fixed for the entire stream. This approach presents several significant drawbacks. First, choosing a p that allows to obtain the desired approximation quality requires to know or guess a number of properties of the input (e.g., the size of the stream). Second, a fixed p implies that the sample size grows with the size of the stream, which is problematic when the stream size is not known in advance: if the user 3 / 26
  4. 4. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  5. 5. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  6. 6. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  7. 7. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; E.g., ∆1 = 2, ∆5 = 3, ∆6 = 0, . . . Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  8. 8. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. 5 / 26
  9. 9. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗; Element on the stream: +, (1, 2) Graph G(t∗): 1 0 4 3 2 5 / 26
  10. 10. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 1; Element on the stream: +, (3, 2) Graph G(t∗): 1 0 4 3 2 5 / 26
  11. 11. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 1; Element on the stream: +, (3, 2) Graph G(t∗+1): 1 0 4 3 2 5 / 26
  12. 12. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 2; Element on the stream: +, (1, 3) Graph G(t∗+1): 1 0 4 3 2 5 / 26
  13. 13. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 2; Element on the stream: +, (1, 3) Graph G(t∗+2): 1 0 4 3 2 5 / 26
  14. 14. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 3; Element on the stream: −, (3, 2) Graph G(t∗+2): 1 0 4 3 2 5 / 26
  15. 15. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 3; Element on the stream: −, (3, 2) Graph G(t∗+3): 1 0 4 3 2 5 / 26
  16. 16. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 4; Element on the stream: +, (1, 5) Graph G(t∗+3): 1 0 4 3 2 5 / 26
  17. 17. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 4; Element on the stream: +, (1, 5) Graph G(t∗+4): 1 0 4 53 2 5 / 26
  18. 18. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 4; Element on the stream: +, (1, 5) Graph G(t∗+4): 1 0 4 53 2 5 / 26
  19. 19. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 5; Element on the stream: +, (4, 5) Graph G(t∗+4): 1 0 4 53 2 5 / 26
  20. 20. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 5; Element on the stream: +, (4, 5) Graph G(t∗+5): 1 0 4 53 2 5 / 26
  21. 21. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 5; Element on the stream: +, (4, 5) Graph G(t∗+5): 1 0 4 53 2 The global and local triangle counts change from G(t) to G(t+1); Our goal: at each time t, give an estimate of ∆G(t) and ∆v , v ∈ V (t). 5 / 26
  22. 22. Why is working on fully-dynamic edge streams difficult? The stream is infinite: storing all (or a constant fraction of) the edges is impossible; There is no end of the stream: post-processing at the end of the stream is impossible; Updates arrive continuously: re-running an algorithm from scratch after each update is infeasible; Triangle counts change continuously: spending a long time on each update to get the exact count is infeasible and illogical; An efficient algorithm for fully-dynamic streams must tackle all these challenges. TRIÈST does. 6 / 26
  23. 23. Why is working on fully-dynamic edge streams difficult? The stream is infinite: storing all (or a constant fraction of) the edges is impossible; → TRIÈST stores a user-specified, fixed amount M of edges; There is no end of the stream: post-processing at the end of the stream is impossible; → TRIÈST needs no postprocessing. Updates arrive continuously: re-running an algorithm from scratch after each update is infeasible; → TRIÈST is incremental and one-pass; Triangle counts change continuously: spending a long time on each update to get the exact count is infeasible and illogical; → TRIÈST computes high-quality estimates; An efficient algorithm for fully-dynamic streams must tackle all these challenges. TRIÈST does. 6 / 26
  24. 24. What is TRIÈST? (the local dialect name of Trieste, a city in the North-East of Italy, next to Slovenia.) TRIÈST (TRIangles EST imation): A suite of 3 algorithms for approximate triangle counting from edge streams: • TRIÈST-BASE: baseline algorithm for insertion-only streams; • TRIÈST-IMPR: improved algorithm for insertion only streams with reduced variance; • TRIÈST-FD: algorithm for fully-dynamic streams. All three algorithms offer unbiased estimators of the local and global triangle counts; We also present a complete analysis of their variance and give concentration bounds; 7 / 26
  25. 25. Aren’t there other algorithms to estimate triangles? There are many algorithms for estimating triangles from data streams; Most-recent ones are based on independent edge sampling with fixed probability; They use an ever-increasing amount of space; Work Single pass Fixed space Local counts Global counts Fully-dynamic streams Becchetti et al. 2010 / Kolountzakis et al. 2012 Pavan et al. 2013 Jha et al. 2015 Ahmed et al. 2014 Lim et al. 2015 TRIÈST TRIÈST is the first to tackle all the challenges; It is based on reservoir sampling, a well-known non-independent sampling scheme; The analysis is challenging, but the gains are worth the price. 8 / 26
  26. 26. What is the general idea behind TRIÈST? Let’s focus on TRIÈST-BASE for now (i.e., insertion-only streams); TRIÈST-BASE maintains a collection S of M edges from the stream; The edges in S induce a graph GS = (VS, S); TRIÈST-BASE maintains the exact values for ∆GS : the number of triangles in GS; and ∆vS : the number of triangles in GS incident to v ∈ VS. Maintaining the exact counts ∆GS and ∆vS , v ∈ V (t) after each update is fast: Estimates for ∆G(t) and ∆v , v ∈ V (t) are obtained from ∆GS and ∆vS by weighting by a probability πt (stay tuned!) 9 / 26
  27. 27. How does TRIÈST-BASE work? TRIÈST-BASE uses a random sampling scheme known as reservoir sampling; At any time t ≤ M, deterministically insert the edge currently on the stream into S; At any t M, flip a coin with tail-bias M/t; If the outcome is head, do nothing; If the outcome is tail : 1) Choose an edge in S u.a.r. and replace it with the edge currently on the stream; 2) Decrease ∆GS and ∆vS , v ∈ VS, by the no. of triangles involving the removed edge; 3) Increase ∆GS and ∆vS , v ∈ VS, by the no. of triangles involving the inserted edge; 10 / 26
  28. 28. Is an example worth a thousand words? Memory: M = 8; Time: end of t∗ − 1; Actions: Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  29. 29. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Actions: Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  30. 30. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  31. 31. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  32. 32. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  33. 33. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3−1 + 1 = 3 11 / 26
  34. 34. Is an example worth a thousand words? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Coin bias: M/(t∗ + 1); Coin flip outcome: Actions: Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  35. 35. Is an example worth a thousand words? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Coin bias: M/(t∗ + 1); Coin flip outcome: head; Actions: Do nothing; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  36. 36. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. 12 / 26
  37. 37. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. Corollary The probability that a triangle (a, b, c) of G(t) is in GS at time t is πt = t − 3 M − 3 t M 12 / 26
  38. 38. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. Corollary The probability that a triangle (a, b, c) of G(t) is in GS at time t is πt = t − 3 M − 3 t M because    t M : M-subsets of E(t) (|E(t)| = t) t − 3 M − 3 : M-subsets of E(t) containing (a, b, c) 12 / 26
  39. 39. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. Corollary The probability that a triangle (a, b, c) of G(t) is in GS at time t is πt = t − 3 M − 3 t M because    t M : M-subsets of E(t) (|E(t)| = t) t − 3 M − 3 : M-subsets of E(t) containing (a, b, c) Hence, TRIÈST-BASE computes the unbiased estimate of ∆G(t) : ∆G(t) = ∆GS πt . 12 / 26
  40. 40. Where are the theorems? We give complete analysis of unbiasedness, variance, and novel concentration bounds; The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent; This makes the analysis of variance and concentration bounds quite challenging; 13 / 26
  41. 41. Where are the theorems? We give complete analysis of unbiasedness, variance, and novel concentration bounds; The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent; This makes the analysis of variance and concentration bounds quite challenging; Theorem (Concentration bound, (ε, δ)-approximation) Let t ≥ 0 and assume |∆(t)| 0. For any ε, δ ∈ (0, 1), let Φ = 3 8ε−2 3h(t) + 1 |∆(t)| ln (3h(t) + 1)e δ . If M ≥ max tΦ 1 + 1 2 ln2/3 (tΦ) , 12ε−1 + e2 , 25 , then |ξ(t)τ(t) − |∆(t)|| ε|∆(t)| with probability 1 − δ. Proving this was fun: we used results on graph coloring,Poisson approximations, and Chernoff bounds. 13 / 26
  42. 42. Ok, but can I show you something? To exactly show the variance of TRIÈST-BASE estimator ∆GS : 1) Express variance as sum of covariances of each pair of triangles: Var(∆GS ) = pairs (a,b) Cov(a, b) 2) Explicitly compute covariance formulas: 2.a) For pairs of triangles sharing an edge, compute the probability of 5 edges being in S: πt (M − 3)(M − 4)) (t − 3)(t − 4) 2.b) For pairs of triangles not sharing an edge, compute the probability of 6 edges being in S: πt (M − 3)(M − 4)(M − 5) (t − 3)(t − 4)(t − 5) The variance depends on the real no. of triangles in G(t) and on the no. of triangles in G(t) sharing an edge. 14 / 26
  43. 43. What is wrong with TRIÈST-BASE? Weaknesses: 1) -BASE uses the exact value of ∆GS at time t to estimate ∆G(t) ; Over time, ∆GS may decrease, and so would the estimation,. . . while ∆G(t ) never decreases: ≥ ∆G(t) for any t t! 2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in S, and the third one is on the stream right now, we may infer that the triangle exists, so we should count it; TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance; 15 / 26
  44. 44. What is wrong with TRIÈST-BASE? Weaknesses: 1) -BASE uses the exact value of ∆GS at time t to estimate ∆G(t) ; Over time, ∆GS may decrease, and so would the estimation,. . . while ∆G(t ) never decreases: ≥ ∆G(t) for any t t! Solution: never decrease the estimate, i.e., use GS only to identify new triangles; 2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in S, and the third one is on the stream right now, we may infer that the triangle exists, so we should count it; Solution: first increment the counters, then decide whether to insert the edge into S; TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance; 15 / 26
  45. 45. How does TRIÈST-IMPR work? Memory: M = 8; Time: end of t∗ − 1; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3 16 / 26
  46. 46. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  47. 47. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  48. 48. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  49. 49. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  50. 50. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  51. 51. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Action: Weighted increment of λ using the of triangles closed by (2, 4) with weight t∗(t∗ − 1)/(M(M − 1)); Coin bias: Coin flip outcome: Actions: Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) +2t∗(t∗−1) M(M−1) 16 / 26
  52. 52. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Action: Weighted increment of λ using the of triangles closed by (2, 4) with weight t∗(t∗ − 1)/(M(M − 1)); Coin bias: M/(t∗ + 1); Coin flip outcome: head; Actions: Do nothing; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) +2t∗(t∗−1) M(M−1) 16 / 26
  53. 53. How does TRIÈST-IMPR estimate the number of triangles? TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) . 17 / 26
  54. 54. How does TRIÈST-IMPR estimate the number of triangles? TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) . Corollary The probability that a triangle of G(t) is “seen” and causes an increment in λ at time t when the third edge of the triangle is on the stream is: ρt = t − 2 M − 2 t − 1 M = M(M − 1) (t − 2)(t − 1) . Since ρt πt, TRI`-EST-IMPR’s estimations have lower variance than TRI`-EST-BASE’s. 17 / 26
  55. 55. Where are the theorems? The order of the updates on the streams affects the probability of “seeing” a triangle; This further complicates the analysis of the variance: Theorem (Upper bound to the variance) Then, for any time t M, we have Var τ(t) ≤ |∆(t) | max 1, (t − 1)(t − 2) (M(M − 1)) − 1 + z(t) t − 1 − M M . We proceed case-by-case: not-intuitive, tedious, pessimistic, inelegant, and loose; 18 / 26
  56. 56. What about fully-dynamic edge streams? Handling deletions is hard; TRIÈST-FD’s approach is inspired by random pairing (Gemulla et al., 2009). TRIÈST-FD tracks all deletions, and update S by removing deleted edges; This is not enough; The resulting S is no longer a uniform sample of the non-deleted edges in G(t); TRIÈST-FD keeps track of the max. number of edges at any time t; This allows to compute the bias of the current S due to unpaired deletions. TRIÈST-FD weights ∆S by the bias, to obtain the estimate for ∆G(t) ; 19 / 26
  57. 57. Where are the experiments? Implementation: C++. Available from http://bit.ly/triestkdd Graphs: Last.fm, Patent-Cit, Patent-Coaut, Twitter, Yahoo!, and others Goals: evaluate variance, runtime, scalability. Environment: Brown CS computing cluster (single core, max 4GB RAM) 20 / 26
  58. 58. How does TRIÈST-IMPR perform? Yahoo! graph with 1.2 billion edges (computing exact ∆G is infeasible); Space M = 1 million ( 0.1% of the graph); 0 1x10 10 2x10 10 3x1010 4x10 10 5x10 10 6x10 10 7x10 10 8x10 10 0 2x10 8 4x10 8 6x10 8 8x10 8 1x10 9 1.2x10 9 Globaltrianglecount Time t max est. min est. avg est. Takeaway: The unbiased estimates are highly concentrated around the mean. 21 / 26
  59. 59. How does TRIÈST-IMPR perform compared to other methods? Last.fm graph (40 million edges, 1 billion triangles); Space M = 100K (0.25% of the graph); Compared with MASCOT (KDD’15), which uses edge sampling with fixed probability; 0 2x10 8 4x10 8 6x10 8 8x10 8 1x10 9 1.2x109 1.4x109 0 5x10 6 1x10 7 1.5x10 7 2x10 7 2.5x10 7 3x10 7 3.5x10 7 Globaltrianglecount Time t ground truth max est. TRIEST-IMPR min est. TRIEST-IMPR max est. MASCOT-I min est. MASCOT-I 0 2x10 7 4x107 6x10 7 8x10 7 1x10 8 1.2x108 0 5x10 6 1x10 7 1.5x10 7 2x10 7 2.5x10 7 3x10 7 3.5x10 7 Std.dev.oftheestimation Time t std dev TRIEST-IMPR std dev MASCOT-I Takeaway: TRIÈST has much more accurate estimations with lower variance. 22 / 26
  60. 60. How does TRIÈST-FD perform? 0 200000 400000 600000 800000 1x10 6 1.2x10 6 1.4x10 6 1.6x10 6 0 5x10 6 1x10 7 1.5x10 7 2x10 7 2.5x10 7 3x10 7 Globaltrianglecount Time t ground truth avg est.+std dev avg est.-std dev avg est. (c) Patent (Cit.) 0 2x10 7 4x10 7 6x107 8x10 7 1x10 8 1.2x10 8 0 1x10 7 2x10 7 3x10 7 4x10 7 5x10 7 6x10 7 7x10 7 8x10 7 Globaltrianglecount Time t ground truth avg est.+std dev avg est.-std dev avg est. (d) LastFm -5x109 0 5x109 1x1010 1.5x1010 2x1010 2.5x10 10 0 5x10 8 1x10 9 1.5x10 9 2x10 9 2.5x10 9 Globaltrianglecount Time t avg est.+std dev avg est.-std dev avg est. (e) Yahoo! Answers Takeaway: 1) The estimations are very accurate; 2) TRIÉST allows to study the evolution of triangles at a level not available before; E.g., it is possible to detect patterns and anomalies. 23 / 26
  61. 61. How scalable is TRIÈST-FD? We measured the average time to handle an update on the stream; 1 10 100 1000 10000 patent-cit patent-coaut lastfm yahoo Avg.microsecsperupdate M=200000 M=500000 M=1000000 Takeaway: between 2 µs/edge and 3 ms/edge; (i.e., between 500k edges/sec. and 300 edges/sec.) 24 / 26
  62. 62. What didn’t I tell you? The Goods: Concentration results (the one for TRIÈST-BASE is very elegant;) Theorems for TRIÈST-FD; TRIÈST for multigraphs (various defs. of triangle counts); Many more experiments and comparisons with state-of-the-art; The Bads: Results on variance are upper bounds, often loose; Some of the concentration bounds are quite naïve (Chebyshev Ineq.); The bounds should not depend on the order of the edges on the stream; The Betters: We are exploring the use of cube sampling and balanced sampling to solve the issues. 25 / 26
  63. 63. What did I talk about? TRIÈST: three algorithms for triangle counts estimation in fully-dynamic edge streams; • Uses a fixed, constant amount of memory; • Is intrinsically incremental; • Scales to billion edges graphs and handles tens of thousands of; edges per second; • Uses reservoir sampling in a smart way; • Gives unbiased, low-variance, highly-concentrated estimates; Complex analysis due to non-independent sampling, but worth the effort! Thank you! EML: matteo@twosigma.com TWTR: @teorionda WWW: http://matteo.rionda.to 26 / 26
  64. 64. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

The authors present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.

Views

Total views

872

On Slideshare

0

From embeds

0

Number of embeds

435

Actions

Downloads

11

Shares

0

Comments

0

Likes

0

×