The document discusses methods for counting triangles (three-node fully connected subgraphs) in large graphs in a distributed manner. It begins by explaining why triangle counting is important for calculating clustering coefficients and analyzing network cohesion. It then describes a naive MapReduce approach that parallelizes across edges but suffers from significant performance problems due to data skew from high degree nodes. The document proposes two approaches to address this problem: (1) counting each triangle from the perspective of the lowest degree node, and (2) using an overlapping divide-and-conquer approach to partition the graph while accounting for triangles that span partitions.
Presented as a Tutorial at the 2023 Knowledge Graph Conference, this deck explores different ways that information can be transformed across knowledge portals, from basic RDF structures to the use of SPARQL UPDATE based Workflows. It then explores how ChatGPT can be used to expand upon this transformation capability, and why knowledge portals should be considered transformation engines for graphs.
Presented as a Tutorial at the 2023 Knowledge Graph Conference, this deck explores different ways that information can be transformed across knowledge portals, from basic RDF structures to the use of SPARQL UPDATE based Workflows. It then explores how ChatGPT can be used to expand upon this transformation capability, and why knowledge portals should be considered transformation engines for graphs.
These presentation was given to Electronics and Communication Students, Sem6, CENTRAL INSTITUTE OF TECHNOLOGY(CIT) KOKRAJHAR. This presentation includes general overview of VLSI industry and career guidance for VLSI industry. The program was initiated by Dr. Agile Mathew, Assistant Professor at CIT.
How to add a yellow circle around mouse cursor? - ThiyaguThiyagu K
This presentation explains the procedure of how to get or add a yellow circle around our mouse cursor or pointer. PenAttention is a free Windows program that displays a highlight, pencil, or pointer at the location of the pen. It's intended for use in presentations on a pen-enabled laptop or PC so our audience can see what we are pointing at on the screen. It is a self explanatory tutorial..
Webinar Presentation: Diagnostic Flash Application with OTXKPIT
This presentation was created for a webinar on "Diagnostic Flash Application with OTX". Flash programming is a use case which In2Soft Diagnostic Tools support very well. Below are the list of use cases that are addressed through the webinar and elucidated using the presentation:
• Designing flash screen with OTX
• Performing diagnostic operations
• Creating variables
• Using loops & branches
• Executing & debugging the application
• Creating sophisticated HTML reports
• Publishing the sequence for diagnostic engineers & testers
These presentation was given to Electronics and Communication Students, Sem6, CENTRAL INSTITUTE OF TECHNOLOGY(CIT) KOKRAJHAR. This presentation includes general overview of VLSI industry and career guidance for VLSI industry. The program was initiated by Dr. Agile Mathew, Assistant Professor at CIT.
How to add a yellow circle around mouse cursor? - ThiyaguThiyagu K
This presentation explains the procedure of how to get or add a yellow circle around our mouse cursor or pointer. PenAttention is a free Windows program that displays a highlight, pencil, or pointer at the location of the pen. It's intended for use in presentations on a pen-enabled laptop or PC so our audience can see what we are pointing at on the screen. It is a self explanatory tutorial..
Webinar Presentation: Diagnostic Flash Application with OTXKPIT
This presentation was created for a webinar on "Diagnostic Flash Application with OTX". Flash programming is a use case which In2Soft Diagnostic Tools support very well. Below are the list of use cases that are addressed through the webinar and elucidated using the presentation:
• Designing flash screen with OTX
• Performing diagnostic operations
• Creating variables
• Using loops & branches
• Executing & debugging the application
• Creating sophisticated HTML reports
• Publishing the sequence for diagnostic engineers & testers
A Very Brief Introduction to Reflections in 2D Geometric Algebra, and their U...James Smith
This document discusses reflections of vectors and of geometrical products of two vectors, in two-dimensional Geometric Algebra (GA). It then uses reflections to solve a simple tangency problem.
A very quick and easy to understand introduction to Gram-Schmidt Orthogonalization (Orthonormalization) and how to obtain QR decomposition of a matrix using it.
A graph G consists of a non empty set V called the set of nodes (points, vertices) of the graph, a set E, which is the set of edges of the graph and a mapping from the set of edges E to a pair of elements of V.
Any two nodes, which are connected by an edge in a graph are called "adjacent nodes".
In a graph G(V,E) an edge which is directed from one node to another is called a "directed edge", while an edge which has no specific direction is called an "undirected edge". A graph in which every edge is directed is called a "directed graph" or a "digraph". A graph in which every edge is undirected is called an "undirected graph".
If some of edges are directed and some are undirected in a graph then the graph is called a "mixed graph".
Any graph which contains some parallel edges is called a "multigraph".
If there is no more than one edge but a pair of nodes then, such a graph is called "simple graph."
A graph in which weights are assigned to every edge is called a "weighted graph".
In a graph, a node which is not adjacent to any other node is called "isolated node".
A graph containing only isolated nodes is called a "null graph". In a directed graph for any node v the number of edges which have v as initial node is called the "outdegree" of the node v. The number of edges to have v as their terminal node is called the "Indegree" of v and Sum of outdegree and indegree of a node v is called its total degree.
In this playlist
https://youtube.com/playlist?list=PLT...
I'll illustrate algorithms and data structures course, and implement the data structures using java programming language.
the playlist language is arabic.
The Topics:
--------------------
1- Arrays
2- Linear and Binary search
3- Linked List
4- Recursion
5- Algorithm analysis
6- Stack
7- Queue
8- Binary search tree
9- Selection sort
10- Insertion sort
11- Bubble sort
12- merge sort
13- Quick sort
14- Graphs
15- Hash table
16- Binary Heaps
Reference : Object-Oriented Data Structures Using Java - Third Edition by NELL DALE, DANEIEL T.JOYCE and CHIP WEIMS
Slides is owned by College of Computing & Information Technology
King Abdulaziz University, So thanks alot for these great materials
3. Why Count Triangles?
Clustering Coefficient:
Given an undirected graph G = (V, E)
cc(v) = fraction of v’s neighbors who are neighbors themselves
|{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|
= dv
2
WWW 2011 3 Sergei Vassilvitskii
4. Why Count Triangles?
Clustering Coefficient:
Given an undirected graph G = (V, E)
cc(v) = fraction of v’s neighbors who are neighbors themselves
|{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|
= dv
2
cc ( ) = N/A
cc ( ) = 1/3
cc ( )=1
cc ( )=1
WWW 2011 4 Sergei Vassilvitskii
5. Why Count Triangles?
Clustering Coefficient:
Given an undirected graph G = (V, E)
cc(v) = fraction of v’s neighbors who are neighbors themselves
|{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| #∆s incident on v
= dv = dv
2 2
cc ( ) = N/A
cc ( ) = 1/3
cc ( )=1
cc ( )=1
WWW 2011 5 Sergei Vassilvitskii
6. Why Clustering Coefficient?
Captures how tight-knit the network is around a node.
cc ( ) = 0.5 cc ( ) = 0.1
vs.
WWW 2011 6 Sergei Vassilvitskii
7. Why Clustering Coefficient?
Captures how tight-knit the network is around a node.
cc ( ) = 0.5 cc ( ) = 0.1
vs.
Network Cohesion:
- Tightly knit communities foster more trust, social norms. [Coleman
’88, Portes ’88]
Structural Holes:
- Individuals benefit form bridging [Burt ’04, ’07]
WWW 2011 7 Sergei Vassilvitskii
8. Why MapReduce?
De facto standard for parallel computation on
large data
– Widely used at: Yahoo!, Google, Facebook,
– Also at: New York Times, Amazon.com, Match.com, ...
– Commodity hardware
– Reliable infrastructure
– Data continues to outpace available RAM !
WWW 2011 8 Sergei Vassilvitskii
9. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
v
Triangles[v]=0
WWW 2011 9 Sergei Vassilvitskii
10. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
v
Triangles[v]=1
w
u
WWW 2011 10 Sergei Vassilvitskii
11. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
v
Triangles[v]=1
w
u
WWW 2011 11 Sergei Vassilvitskii
12. How to Count Triangles
Sequential Version:
foreach v in V
foreach u,w in Adjacency(v)
if (u,w) in E
Triangles[v]++
Running time: d2
v
v∈V
Even for sparse graphs can be quadratic if one vertex has high
degree.
WWW 2011 12 Sergei Vassilvitskii
13. Parallel Version
Parallelize the edge checking phase
WWW 2011 13 Sergei Vassilvitskii
14. Parallel Version
Parallelize the edge checking phase
– Map 1: For each v send (v, Γ(v)) to single machine.
– Reduce 1: Input: v; Γ(v)
Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u)
( , ); ( , ); ( , );
WWW 2011 14 Sergei Vassilvitskii
15. Parallel Version
Parallelize the edge checking phase
– Map 1: For each v send (v, Γ(v)) to single machine.
– Reduce 1: Input: v; Γ(v)
Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u)
( , ); ( , ); ( , );
– Map 2: Send (v1 , v2 ); u and (v1 , v2 ); $ for (v1 , v2 ) ∈ E to same
machine.
– Reduce 2: input: (v, w); u1 , u2 , . . . , uk , $?
Output: if $ part of the input, then: ui = ui + 1/3
( , ); , $ −→ +1/3 +1/3 +1/3
( , ); −→
WWW 2011 15 Sergei Vassilvitskii
16. Data skew
How much parallelization can we achieve?
- Generate all the paths to check in parallel
- The running time becomes max d2 v
v∈V
WWW 2011 16 Sergei Vassilvitskii
17. Data skew
How much parallelization can we achieve?
- Generate all the paths to check in parallel
- The running time becomes max d2 v
v∈V
Naive parallelization does not help with data skew
– Some nodes will have very high degree
– Example. 3.2 Million followers, must generate 10 Trillion (10^13)
potential edges to check.
– Even if generating 100M edges per second, 100K seconds ~ 27 hours.
WWW 2011 17 Sergei Vassilvitskii
18. “Just 5 more minutes”
Running the naive algorithm on LiveJournal Graph
– 80% of reducers done after 5 min
– 99% done after 35 min
WWW 2011 18 Sergei Vassilvitskii
19. Adapting the Algorithm
Approach 1: Dealing with skew directly
– currently every triangle counted 3 times (once per vertex)
– Running time quadratic in the degree of the vertex
– Idea: Count each once, from the perspective of lowest degree vertex
– Does this heuristic work?
WWW 2011 19 Sergei Vassilvitskii
20. Adapting the Algorithm
Approach 1: Dealing with skew directly
– currently every triangle counted 3 times (once per vertex)
– Running time quadratic in the degree of the vertex
– Idea: Count each once, from the perspective of lowest degree vertex
– Does this heuristic work?
Approach 2: Divide Conquer
– Equally divide the graph between machines
– But any edge partition will be bound to miss triangles
– Divide into overlapping subgraphs, account for the overlap
WWW 2011 20 Sergei Vassilvitskii
21. How to Count Triangles Better
Sequential Version [Schank ’07]:
foreach v in V
foreach u,w in Adjacency(v)
if deg(u) deg(v) deg(w) deg(v)
if (u,w) in E
Triangles[v]++
WWW 2011 21 Sergei Vassilvitskii
22. Does it make a difference?
WWW 2011 22 Sergei Vassilvitskii
23. Dealing with Skew
Why does it help?
– Partition nodes into two groups:
√
• Low: L = {v : dv ≤ m}
√
• High: H = {v : dv m}
– There are at most n low nodes; each produces at most O(m) paths
√
– There are at most 2 m high nodes
• Each produces paths to other high nodes: O(m) paths per node
WWW 2011 23 Sergei Vassilvitskii
24. Dealing with Skew
Why does it help?
– Partition nodes into two groups:
√
• Low: L = {v : dv ≤ m}
√
• High: H = {v : dv m}
– There are at most n low nodes; each produces at most O(m) paths
√
– There are at most 2 m high nodes
• Each produces paths to other high nodes: O(m) paths per node
– These two are identical !
– Therefore, no mapper can produce substantially more work than
others.
3
– Total work is O(m /2 ) , which is optimal
WWW 2011 24 Sergei Vassilvitskii
25. Approach 2: Graph Split
Partitioning the nodes:
- Previous algorithm shows one way to achieve better parallelization
- But what if even O(m) is too much. Is it possible to divide input into
smaller chunks?
Graph Split Algorithm:
– Partition vertices into p equal sized groups V1 , V2 , . . . , Vp .
– Consider all possible triples (Vi , Vj , Vk ) and the induced subgraph:
Gijk = G [Vi ∪ Vj ∪ Vk ]
– Compute the triangles on each Gijk separately.
WWW 2011 25 Sergei Vassilvitskii
26. Approach 2: Graph Split
Some Triangles present in multiple subgraphs:
in p-2 subgraphs
Vi Vj
in 1 subgraph
Vk
in ~p2 subgraphs
Can count exactly how many subgraphs each triangle will be in
WWW 2011 26 Sergei Vassilvitskii
27. Approach 2: Graph Split
Analysis:
– Each subgraph has O(m/p2 ) edges in expectation.
– Very balanced running times
WWW 2011 27 Sergei Vassilvitskii
28. Approach 2: Graph Split
Analysis:
– Very balanced running times
– p controls memory needed per machine
WWW 2011 28 Sergei Vassilvitskii
29. Approach 2: Graph Split
Analysis:
– Very balanced running times
– p controls memory needed per machine
3
– Total work: p · O((m/p2 )3/2 ) = O(m
3/2
) , independent of p
WWW 2011 29 Sergei Vassilvitskii
30. Approach 2: Graph Split
Analysis:
– Very balanced running times
– p controls memory needed per machine
3
– Total work: p · O((m/p2 )3/2 ) = O(m
3/2
) , independent of p
Input too big: Shuffle time
paging increases with
duplication
WWW 2011 30 Sergei Vassilvitskii
31. Overall
Naive Parallelization Doesn’t help with Data Skew
WWW 2011 31 Sergei Vassilvitskii
32. Related Work
• Tsourakakis et al. [09]:
– Count global number of triangles by estimating the trace of the cube
of the matrix
– Don’t specifically deal with skew, obtain high probability
approximations.
• Becchetti et al. [08]
– Approximate the number of triangles per node
– Use multiple passes to obtain a better and better approximation
WWW 2011 32 Sergei Vassilvitskii
34. Conclusions
Think about data skew.... and avoid the curse
WWW 2011 33 Sergei Vassilvitskii
35. Conclusions
Think about data skew.... and avoid the curse
– Get programs to run faster
WWW 2011 33 Sergei Vassilvitskii
36. Conclusions
Think about data skew.... and avoid the curse
– Get programs to run faster
– Publish more papers
WWW 2011 33 Sergei Vassilvitskii
37. Conclusions
Think about data skew.... and avoid the curse
– Get programs to run faster
– Publish more papers
– Get more sleep
WWW 2011 33 Sergei Vassilvitskii
38. Conclusions
Think about data skew.... and avoid the curse
– Get programs to run faster
– Publish more papers
– Get more sleep
– ..
WWW 2011 33 Sergei Vassilvitskii
39. Conclusions
Think about data skew.... and avoid the curse
– Get programs to run faster
– Publish more papers
– Get more sleep
– ..
– The possibilities are endless!
WWW 2011 33 Sergei Vassilvitskii